Closed misialq closed 2 years ago
Hey @adamovanja, thanks for your review! Just for comparison, I tried re-downloading some of my genome data and with a couple of workers it now took ~4h instead of 30+ 🙈 . So, you're right, for smaller datasets the difference is not as striking but the bigger they get the more time one can save :)
thanks for the clarifications 🚀 Nice, I can't wait to try it out on my long list of projectIDs.
This PR introduces parallelization into the
get-sequences
method to make sequence fetching lightning-fast.Summary of changes:
_run_fasterq_dump_for_all
and_process_downloaded_sequences
run in their own separate processes_write2casava_dir
runs in a process pool with a configurable number of workersThe diagram below shows how these new elements play together:
~One minor issue I found is that the logs produced inside of these processes seem to be swallowed somewhere and they do not show. Needs investigating.~
Testing: These changes can be tested with pretty much any list of IDs. Just compare runtimes between main and this PR using a couple of _njobs.
Closes #89.