alexstaj / cutadapt

Automatically exported from code.google.com/p/cutadapt
0 stars 0 forks source link

Add multi-threading and a suggested way to increase speed #44

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is a request for feature enhancement, not a defect.

Would it be possible to add multithreading as an option?

I'm running against a conf file of 25 adapters with -b and -n 2, with about 7M 
50 bp reads, and it takes a few hours. Not a big deal, just wondering what 
multithreading would do for this. I'm using version 1.0

Another thought. I've not looked at the source code (Python and C are not my 
strong suits), so you may already be doing something like I am about to 
suggest:  
if no 'quality'-related options are selected (e.g., -q --quality-base), then 
how about keeping only the unique reads, processing those, and then applying 
the same rule to the duplicate reads. When there is a lot of contamination (and 
therefore duplicate reads), this may speed things up. I realize there are tools 
made just for that purpose, but I thought I would throw it out there.

Thanks for this handy tool!

Original issue reported on code.google.com by jwad...@gmail.com on 8 May 2012 at 9:36

GoogleCodeExporter commented 9 years ago
Python’s a nice language, you should try it :)

Multithreading is on my to-do list and there are also some algorithmic 
improvements possible that will speed up the single-thread case. I have to say 
though that my time to work on this is quite limited, right now.

Regarding duplicate reads: The problem is to detect whether a read is a 
duplicate of another. Since the first read could be a duplicate of the last 
read, all reads and results would need to be kept in memory. Keeping only the 
last 1000 or so reads in memory would perhaps be an option and may already help 
a bit, but I think there are other potential improvements that would help even 
more. (Parsing the input, for example.)

Original comment by marcel.m...@tu-dortmund.de on 10 May 2012 at 6:54

GoogleCodeExporter commented 9 years ago
I have recently done some experiments, trying to get multithreading into 
cutadapt, but there's a really large overhead that comes from the communication 
between the threads. In the end, the multithreaded version was actually slower 
than the non-threaded one. One more idea may be to provide a wrapper script 
that splits the input FASTQ file and then runs multiple cutadapt instances, but 
I guess something like this exists already. For now, I've decided not to work 
on this further.

Original comment by marcel.m...@tu-dortmund.de on 19 Jun 2014 at 2:10