Closed ajphipps closed 6 years ago
Sorry just saw the closed issues with questions of multithreaded support - sorry for opening the new issue!
Another thought I had was it might be possible to use GNU parallel to run multiple instances of trim_galore if you have many samples.
Hi Andrew, thanks for your comments. I can see that it would be 'nice-to-have' from a conceptual point of view to have trimming that is faster. We ourselves are typically dealing with the situation that we need to process hundreds (or even thousands) of samples in parallel (and have some 700 cores available to do so) rather than having a single sample which we would like to process 700 times faster. I suppose having a cluster readily available doesn't exactly create an incentive to work on the speed of trimming. But if demand rises further - I might reconsider...! Cheers, Felix
I'd be very interested in being able to use the new multi-threading features of cutadapt via TrimGalore. Please consider this a +1 for demand!
Noted, thanks.
+1 for this feature, google brought me here with the same request in mind!
If some nice person were to create a PR implementing this would you consider merging it?
I would definitely consider that. I suppose we, as in 'the nice person', would have to make sure that it still runs fine on machines that don't have Python 3 installed etc though.
While we are at it, did you manage to finish the MultiQC module for SNPsplit reports? Maybe I could find some time to get that done at some point? Cheers, Felix
While we are at it, did you manage to finish the MultiQC module for SNPsplit reports? Maybe I could find some time to get that done at some point? Cheers, Felix
[Devon looks away embarrassed] Right, that, it's still on my to do list
Don't be, it was super cool of you to make a start on it anyways. I was considering to try out how to add something to MultiQC for some time, so I could try to take a look at this if you didn't get round to it.
But regarding Trim Galore, I assume getting the multicore trimming to work would require a check for the version of python that is currently running, as well as whether the parallel zipping is installed. It might get more icky for downstream tests that are performed in Trim Galore itself, such as the length validation, N-trimming etc, which all require the reads to be present in a synchronised fashion between R1 and R2...
Yeah, it'll require some testing before I would want to make a PR, though I can tell you that cutadapt returns R1 and R2 synchronized, since it trims both at once.
Feel free to play around with my current SNPsplit PR and improve it. We're not using it heavily at the moment so there's not a lot of motivation.
Yeah, it'll require some testing before I would want to make a PR, though I can tell you that cutadapt returns R1 and R2 synchronized, since it trims both at once.
That is good because it will make a lot of things easier. It will mean though that we need to move the paired-end validation testing from two consecutive trimming processes into this new trimming mode. This might be a bit of work but was probably overdue anyway.... But yea it would be great if you wanted to start on that, I'm of course happy to assist if possible at all.
Feel free to play around with my current SNPsplit PR and improve it. We're not using it heavily at the moment so there's not a lot of motivation.
Same here, but I am always looking for things to broaden my horizon...
Upon further inspection it looks like implementing this would require an almost complete rewrite of how Trim Galore is internally structured, which is rather more involved than I have time for at the moment. For us the most useful feature of Trim Galore is the automatic detection of the most appropriate adapter sequence, I'll instead check and see if I can implement that in cutadapt
directly.
That's pretty much the same conclusion I came to. Given that we never really are in need of having to speed up any single trimming event (as we can run hundreds of them in parallel), adding multithread support didn't seem to be justified, or at least high up on a priority list.
Thanks for looking anyway. Cheers, Felix
Another vote for making Trim Galore/Cutadapt multi-threaded, if possible.....
Thanks for the vote Wolfgang. In my preliminary tests (see issue #39) I found that multi-core support doesn't exactly look worthwhile (certainly not specifying beyond 2 cores...):
Maybe indeed splitting the input into junks would be another option to speed things up?
@FelixKrueger, were these tests ran with pigz installed? In my tests, running Trim-galore with more than two cores (using code from #39) provided a speed up compared to two cores alone. Maybe you are be limited by how fast gzip can process the fastq files?
@FelixKrueger, I provided some benchmarks from my system, as well as updated the code to do some checks. Please check out the updates in #39.
No, these tests were run without pigz
installed. I'll make sure to give #39 a try soon. Thanks for your work on this.
Dear Felix,
I really like this piece of software and have been testing it for my research, but I was wondering how difficult it would be to implement multithreaded support that is available in basic cutadapt? Is it possible to pass the option or arguement from trim_galore to cutadapt?
My current loop I am using is as follows: for file in *all.fastq ; do trim_galore -fastqc_args "-t 60" -o /NGS_Data/Andrew/test/test2/ ${file} done
Thanks again, Andrew