bokulich-lab / q2-fondue

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

ENH: Parallelize sequence download and processing #90

Closed misialq closed 2 years ago

misialq commented 2 years ago

This PR introduces parallelization into the get-sequences method to make sequence fetching lightning-fast.

Summary of changes:

The diagram below shows how these new elements play together: queue_design

~One minor issue I found is that the logs produced inside of these processes seem to be swallowed somewhere and they do not show. Needs investigating.~

Testing: These changes can be tested with pretty much any list of IDs. Just compare runtimes between main and this PR using a couple of _njobs.

Closes #89.

misialq commented 2 years ago

Hey @adamovanja, thanks for your review! Just for comparison, I tried re-downloading some of my genome data and with a couple of workers it now took ~4h instead of 30+ 🙈 . So, you're right, for smaller datasets the difference is not as striking but the bigger they get the more time one can save :)

adamovanja commented 2 years ago

thanks for the clarifications 🚀 Nice, I can't wait to try it out on my long list of projectIDs.