Split libraries should be parallelized

alk224 commented 9 years ago

I hacked together a parallel version of split libraries fastq in a couple of hours. It runs much much faster than the native qiime command. Sometimes this can be one of rate-limiting steps in getting QC data to clients, so if it can be faster it would be useful to more people than just myself.

See my code here: https://github.com/alk224/FirstRepository

gregcaporaso commented 9 years ago

Thanks for sharing the code @alk224. Agree - we're planning a complete re-write of this for QIIME 2, and it will likely be parallelized at that time. I'll assign this to the QIIME 2 milestone.

wasade commented 9 years ago

@alk224, just to note, I believe the use of split assumes the input data are single line fasta/fastq and that the sequence data are not spread over multiple lines. I may be wrong though

On Wed, Dec 10, 2014 at 4:10 PM, Greg Caporaso notifications@github.com wrote:

Thanks for sharing the code @alk224 https://github.com/alk224. Agree - we're planning a complete re-write of this for QIIME 2, and it will likely be parallelized at that time. I'll assign this to the QIIME 2 milestone.

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1773#issuecomment-66542101.

alk224 commented 9 years ago

Yes, split is totally inappropriate here. But I don't really know any better. I think awk is the appropriate command to use as it will enable proper naming of output files. The way I used split it is breaking up the files at each line. This script will crash if you choose the wrong number of cores and you wind up with a broken fastq line somewhere (or in several places).

jairideout commented 9 years ago

Thanks @alk224, agree this will be useful to address in QIIME 2!

colinbrislawn commented 9 years ago

:+1: This is very much needed.

It just took me 30 minutes to join and 45 mins to split a single MiSeq run. I shutter to think what a HiSeq run takes.

alk224 commented 9 years ago

@colinbrislawn I may have a solution in the near future using a dependency called fasta splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/), but I still need to add a step where the sample names of each split libraries output is modified. Otherwise, you will have multiple sequences with identical IDs (one duplicate name for every separate core you utilize).

alk224 commented 9 years ago

I forgot I deleted that old repository. That script now lives here. I will probably rewrite the whole thing soon so it is less idiotic.

https://github.com/alk224/akutils-dev/blob/master/dev-split_libraries_parallel.sh

zachcp commented 9 years ago

+1 for a faster split_libraries and split_libraries_fastq.

I wanted some extra speed and I ended up writing something similar to @alk224 which is based on the idea of splitting big fastq files, batch processing them in parallel and recombining the results. You can check it out here under parallel_split_libraries_fastq. I am able to get a miseq runsplit_libraries_fastq job to go from from hours/overnight to under 10 minutes using a machine with 30 cores.

As we still have a little bit of 454 data in the lab, I have also taken a stab at making split_library.py parallel. My work on that is in this branch. In that work I tried to refactor the main work function check_seq. I represented each sequence as a dictionary of values and have lots of functions act on the dictionaries thereby divorcing the sequence/barcode/testing info from the loop. Quality test? Put it in the dict. Barcode test? put it in the dict. I can then pass the values around and have lots of functions act on it and tally up the results later. That style lets you hit the stream of sequence-dictionaries with multiprocessing and fan your work out.

So I got a decent speed-up on a test set: 16 minutes (1core) to 10 minutes (3 core) using extra cores but the bottleneck is still in fact that you need to load and lookup quality information for each sequence. However, that is tricky to parallelize for the following reason: there are no global variables in Multiprocessing Processes. You either have to give each process its own copy of the reference data which is fine when the data is small but not fine when it is a multi-Gb dictionary, or you need to have some sort of external database that can be independently called by each Process. At that point it wasn't worth my time to implement the Quality dictionary as a database but I bet it could be done and made to be very fast.

In short, I think adopting a functional style could allow for big speedups in both scripts and I would be happy to work on this a little bit more if is useful or could lead to inclusion. If not, I have my workarounds/speedups and it was a good lesson in python parallelization.

thanks, zach cp

alk224 commented 9 years ago

@zachcp excellent and thank you! I tried rewriting my old version a few weeks ago and it ran perfectly on test data, but kept messing on real data (I was running the old split_libraries_fastq.py simultaneously to check that I had identical outputs). I had a similar speed up. One thing I did manage to do was collate all the information from the output log files so that you have a coherent split_libraries_log.txt file. I did this with simple bash tools. I'm still pretty bad at python, but maybe it is useful to combine the python and bash components of our scripts (if anyone bilingual actually has time for this). The current version of my script is here: https://github.com/alk224/akutils/blob/master/split_libraries_parallel.sh.

billyfournier commented 8 years ago

I am not sure if this applies to all of split_libraries functionality, but I was able to reproduce split_libraries primer trimming feature by incorporating a memoizied adaptation of a levenshtien search. This reduced the time on an ~3Gig merged data set from ~5hr --> ~30min. https://github.com/billyfournier/remove_primers_levenshtien/blob/master/extract_primer_lev.py Again, this only deals with the primer removal functionality, but maybe someone will find it useful.

I attempted using pools and threads with no success.

colinbrislawn commented 8 years ago

For those using usearch/vsearch for quality filtering, VSEARCH now supports --relabel. Combine this with gnu parallel and you can append qiime compatible labels to many files at very high speed.

The usefulness of this depends on what quality filtering paradigm is adopted for qiime 2. If the 'maximum expected error' filtering is chosen over 'sliding window', this would be a great option.

biocore / qiime

Split libraries should be parallelized #1773