When fetching tens of runs, very often one needs to wait a very long time for q2-fondue to process and save the sequences (even with amplicon data, not to mention (meta)genomes). Looking at the code makes me realize that there are two main issues with the approach we are taking within the get-sequences method (and here we were, blaming it on QIIME! 🙈 ):
All the steps are executed sequentially (download -> pre-process (incl. renaming) -> process (write to final files)).
Within the (pre)-process steps, files are processed one-by-one.
Two main, relatively easy solutions (at least for now) addressing those points could be:
Pre-processing and writing can be executed as soon as the download of a given ID is finished. Since download is not CPU intensive, we can make use of the idling CPUs to start processing the data.
Independent runs do not need to be processed one-by-one - that step can easily be parallelized using a pool of workers.
When fetching tens of runs, very often one needs to wait a very long time for q2-fondue to process and save the sequences (even with amplicon data, not to mention (meta)genomes). Looking at the code makes me realize that there are two main issues with the approach we are taking within the get-sequences method (and here we were, blaming it on QIIME! 🙈 ):
download
->pre-process
(incl. renaming) ->process
(write to final files)).Two main, relatively easy solutions (at least for now) addressing those points could be: