ENH: Parallelize sequence download and processing

misialq commented 2 years ago

This PR introduces parallelization into the get-sequences method to make sequence fetching lightning-fast.

Summary of changes:

_run_fasterq_dump_for_all and _process_downloaded_sequences run in their own separate processes
_write2casava_dir runs in a process pool with a configurable number of workers
information exchange between those processes is achieved by means of three queues: one for storing IDs that were already downloaded, another one for storing filenames of pre-processed files and the last one for storing filenames of sequences that were successfully processed.

The diagram below shows how these new elements play together: queue_design

~One minor issue I found is that the logs produced inside of these processes seem to be swallowed somewhere and they do not show. Needs investigating.~

Testing: These changes can be tested with pretty much any list of IDs. Just compare runtimes between main and this PR using a couple of _njobs.

Closes #89.

misialq commented 2 years ago

Hey @adamovanja, thanks for your review! Just for comparison, I tried re-downloading some of my genome data and with a couple of workers it now took ~4h instead of 30+ 🙈 . So, you're right, for smaller datasets the difference is not as striking but the bigger they get the more time one can save :)

adamovanja commented 2 years ago

thanks for the clarifications 🚀 Nice, I can't wait to try it out on my long list of projectIDs.

bokulich-lab / q2-fondue

ENH: Parallelize sequence download and processing #90