clemente-lab / mmeds-meta

A database for storing and analyzing omics data
https://mmeds.org
2 stars 1 forks source link

Parallelize analysis sections #400

Closed adamcantor22 closed 4 months ago

adamcantor22 commented 2 years ago

Is your feature request related to a problem? Please describe. The current solution for multiple demux/denoise runs per analysis is to have them run in series. This is quite inefficient for larger studies, and may need to be submitted to -q long in order to run successfully.

Describe the solution you'd like These should be able to be done in parallel, probably using a qiime1-esque solution in which output files are checked for. When all the output files exist, the main job can be started which will merge all the sub-components. This solution should be generalized enough that it could potentially be used for other parallelization (e.g. with a new ANCOM implementation #386).

Describe alternatives you've considered We briefly discussed multi-threading, but quickly dismissed it as it would raise our code's complexity by quite a lot.

cleme commented 2 years ago

Q1 used to have a solution based on: a job is submitted spawning worker sub-jobs, which do the computation, while the main job remains waiting until all output files have been created. Details here:

https://github.com/biocore/qiime/tree/master/qiime/parallel

poller.py and util.py have most of the functionality that we would require. This solution is not ideal, because when worker jobs do not complete, there is no way for the main job to "know" the files won't be created and it keeps waiting until it hits walltime. It might be worth to review how Q2 implements parallelization.

adamcantor22 commented 12 months ago

While "full" parallelization is a challenging issue, there are a number of simple changes we could make to parallelize sections. This includes parallelizing differential abundance testing, taxa summarizing, and most importantly, demux/denoising. When there are many sequencing runs in a study, this step is much more serialized than it needs to be. Each run imports the fastqs to qiime artifact, demuxes, and denoises sequentially, then moves to the next run. These individual steps can be safely run in parallel across all runs. I.e., all fastq imports run in parallel, then all demuxes run in parallel, then all denoises. This will significantly speed these runs up. It may be challenging to do this when working with runs of different types (e.g. single vs dual barcodes) but at least, this can be applied to runs of the same type.

adamcantor22 commented 4 months ago

superceded by snakemake, which has this functionality #457