Kennedy-Lab-UW / Duplex-Sequencing

Other
57 stars 34 forks source link

reduce speedup: list() 'in' is O(n), set() is O(1) #41

Closed aogier closed 7 years ago

aogier commented 8 years ago

Hi, I've found a long running tag_to_header user process on bioinfo cluster I manage. After a little investigation it turned out that beyond a certain sequences amount, when using --tagstats and --reduce, script suddenly become heavily cpu-bound. I think I have found the culprit and I'll now like to share that fix with you.

Some rough measurements I've done:

Million sequences vanilla modified
1 35'' 35''
2 163'' 60''
3 766'' 93''
4 2186'' 118''

we are working on a ~143M sequences input dataset so figure out :)

Python wiki page explaining in behaviour: https://wiki.python.org/moin/TimeComplexity

HTH, regards

scottrk commented 8 years ago

Hi, I would strongly suggest that you don't use the --reduce option, as there's something not quite right with how it works. It was experimental when I put it in and I'm not really supporting it. In fact, I would encourage you to try the UnifiedConsensusCaller.py that I recently posted to the GitHub. It is a single program that takes an unaligned bam file, which I typically create by Picard FastqToSam, and outputs final duplex consensuses (and single-strand consensuses, if desired) in paired gzipped fastq files, as well as the relevant tagstats plots. It saves a huge amount of time and doesn't incur any biases that could result from performing an alignment first.

aogier commented 8 years ago

Hi, many thanks for your kind feedback which I've forwarded to our scientist !

Have a nice day, regards