FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
469 stars 151 forks source link

Multithreaded support from Cutadapt in trim_galore #16

Closed ajphipps closed 6 years ago

ajphipps commented 6 years ago

Dear Felix,

I really like this piece of software and have been testing it for my research, but I was wondering how difficult it would be to implement multithreaded support that is available in basic cutadapt? Is it possible to pass the option or arguement from trim_galore to cutadapt?

My current loop I am using is as follows: for file in *all.fastq ; do trim_galore -fastqc_args "-t 60" -o /NGS_Data/Andrew/test/test2/ ${file} done

Thanks again, Andrew

ajphipps commented 6 years ago

Sorry just saw the closed issues with questions of multithreaded support - sorry for opening the new issue!

Another thought I had was it might be possible to use GNU parallel to run multiple instances of trim_galore if you have many samples.

FelixKrueger commented 6 years ago

Hi Andrew, thanks for your comments. I can see that it would be 'nice-to-have' from a conceptual point of view to have trimming that is faster. We ourselves are typically dealing with the situation that we need to process hundreds (or even thousands) of samples in parallel (and have some 700 cores available to do so) rather than having a single sample which we would like to process 700 times faster. I suppose having a cluster readily available doesn't exactly create an incentive to work on the speed of trimming. But if demand rises further - I might reconsider...! Cheers, Felix

ibwoo commented 6 years ago

I'd be very interested in being able to use the new multi-threading features of cutadapt via TrimGalore. Please consider this a +1 for demand!

FelixKrueger commented 6 years ago

Noted, thanks.

ncimb commented 6 years ago

+1 for this feature, google brought me here with the same request in mind!

dpryan79 commented 6 years ago

If some nice person were to create a PR implementing this would you consider merging it?

FelixKrueger commented 6 years ago

I would definitely consider that. I suppose we, as in 'the nice person', would have to make sure that it still runs fine on machines that don't have Python 3 installed etc though.

While we are at it, did you manage to finish the MultiQC module for SNPsplit reports? Maybe I could find some time to get that done at some point? Cheers, Felix

dpryan79 commented 6 years ago

While we are at it, did you manage to finish the MultiQC module for SNPsplit reports? Maybe I could find some time to get that done at some point? Cheers, Felix

[Devon looks away embarrassed] Right, that, it's still on my to do list

FelixKrueger commented 6 years ago

Don't be, it was super cool of you to make a start on it anyways. I was considering to try out how to add something to MultiQC for some time, so I could try to take a look at this if you didn't get round to it.

But regarding Trim Galore, I assume getting the multicore trimming to work would require a check for the version of python that is currently running, as well as whether the parallel zipping is installed. It might get more icky for downstream tests that are performed in Trim Galore itself, such as the length validation, N-trimming etc, which all require the reads to be present in a synchronised fashion between R1 and R2...

dpryan79 commented 6 years ago

Yeah, it'll require some testing before I would want to make a PR, though I can tell you that cutadapt returns R1 and R2 synchronized, since it trims both at once.

Feel free to play around with my current SNPsplit PR and improve it. We're not using it heavily at the moment so there's not a lot of motivation.

FelixKrueger commented 6 years ago

Yeah, it'll require some testing before I would want to make a PR, though I can tell you that cutadapt returns R1 and R2 synchronized, since it trims both at once.

That is good because it will make a lot of things easier. It will mean though that we need to move the paired-end validation testing from two consecutive trimming processes into this new trimming mode. This might be a bit of work but was probably overdue anyway.... But yea it would be great if you wanted to start on that, I'm of course happy to assist if possible at all.

Feel free to play around with my current SNPsplit PR and improve it. We're not using it heavily at the moment so there's not a lot of motivation.

Same here, but I am always looking for things to broaden my horizon...

dpryan79 commented 6 years ago

Upon further inspection it looks like implementing this would require an almost complete rewrite of how Trim Galore is internally structured, which is rather more involved than I have time for at the moment. For us the most useful feature of Trim Galore is the automatic detection of the most appropriate adapter sequence, I'll instead check and see if I can implement that in cutadapt directly.

FelixKrueger commented 6 years ago

That's pretty much the same conclusion I came to. Given that we never really are in need of having to speed up any single trimming event (as we can run hundreds of them in parallel), adding multithread support didn't seem to be justified, or at least high up on a priority list.

Thanks for looking anyway. Cheers, Felix

wolfgangrumpf commented 5 years ago

Another vote for making Trim Galore/Cutadapt multi-threaded, if possible.....

FelixKrueger commented 5 years ago

Thanks for the vote Wolfgang. In my preliminary tests (see issue #39) I found that multi-core support doesn't exactly look worthwhile (certainly not specifying beyond 2 cores...):

impact

Maybe indeed splitting the input into junks would be another option to speed things up?

fjames003 commented 5 years ago

@FelixKrueger, were these tests ran with pigz installed? In my tests, running Trim-galore with more than two cores (using code from #39) provided a speed up compared to two cores alone. Maybe you are be limited by how fast gzip can process the fastq files?

fjames003 commented 5 years ago

@FelixKrueger, I provided some benchmarks from my system, as well as updated the code to do some checks. Please check out the updates in #39.

FelixKrueger commented 5 years ago

No, these tests were run without pigz installed. I'll make sure to give #39 a try soon. Thanks for your work on this.