clemente-lab / mmeds-meta

A database for storing and analyzing omics data
https://mmeds.org
2 stars 1 forks source link

Re-evaluate structure of MMEDS analysis pipeline (specifically ANCOM) #386

Closed adamcantor22 closed 4 months ago

adamcantor22 commented 2 years ago

Is your feature request related to a problem? Please describe. This can be an opportunity to discuss more than just this one thing, but this is specifically about the role of the ANCOM significance tests (qiime composition ancom). Currently, an ancom test is run for every variable and taxa level (plus one that's not taxa level-specific) listed in the .yaml file. On an average study, this is probably about 2 taxa levels and 3-6 variables, for a total of 12-18 tests. For studies that have no more than a few dozen samples, this is okay, taking about 10-15 minutes per test. However, when getting to larger studies of several hundred samples, it quickly becomes unsustainable. One study containing 572 samples took 14 hours to complete one of these significance tests, timing out before it could complete a second. A larger study, containing 1,566 samples did not even complete one test, timing out after 84 hours. Additionally, these tests rarely see use. We far more commonly use the significance testing of alpha and beta diversity metrics, which take next to no time at all to run.

Describe the solution you'd like I think these tests should be made optional, either through the .yaml configuration files, or the interactive system that replaces those. They are too unwieldy and unused to make it into our "standard" pipeline.

Additional context It should be noted that these tests often taking up the remainder of time in an analysis job means that the job never gets a chance to reach the summary step on its own, and has to be manually started again. This is far from ideal from a user perspective.

adamcantor22 commented 2 years ago

Something else to consider: we are currently using pheniqs demultiplexing only for studies with dual barcodes. Studies with paired-end reads and single barcodes (the other most common format) still use qiime demux emp-paired for demultiplexing. Are we interested in switching over to pheniqs for this type as well? Perhaps we can do a test to compare the ASV outputs of both methods for a single study. Our pheniqs method does have at least one advantage: the ability to specify number of allowed barcode errors.

cleme commented 2 years ago

This should be split into different issues: one for ANCOM (this one), one for demultiplexing (new issue).

On the topic of ANCOM: is this parallelizable? Or is it simply that a single ANCOM test takes a very long time when sample size is >500? This post suggests ANCOM simply takes a very long time:

https://forum.qiime2.org/t/ancom-running-for-30-hours/3991

If it is really a matter of efficiency, then this is not something we want to tackle now, but we should look into QIIME's ANCOM implementation and think if there is something we can do to speed it up.

adamcantor22 commented 2 years ago

Per discussion, not going to consider this demultiplexing issue, need to keep our current non-dual barcodes solutions for now.

adamcantor22 commented 4 months ago

superceded by overhaul of analysis system to snakemake workflows, #457