a-h-b / dadasnake

Amplicon sequencing workflow heavily using DADA2 and implemented in snakemake
GNU General Public License v3.0
45 stars 19 forks source link

Within-run pooling #7

Closed vmikk closed 3 years ago

vmikk commented 3 years ago

Hello Anna!

This feature request is somehow related to #6.

Currently, there are three DADA2 modes in Dadasnake: run per sample, pool, pseudo-pooling. Unfortunately, 120GB RAM is not enough to perform pooled inference on our data. So we are using sample-wise removal of sequencing errors now (dada_dadaReads.single.R to be exact.) However, it is possible to perform within-run-pooling.

For this purpose it is possible to use errors/models.{run}.RDS generated for each run and dada_dadaReads.pool.R with FASTQs for the same run as input.

To my surprise, it was much faster (but of course more RAM-demanding) then sample-wise inference (due to the issue mentioned in #6). So this mode will avoid spawning of multiple tasks for creation of merged/{run}/{sample}.RDS and will directly produce merged/dada_merged.{run}.RDS. And, in theory, this mode should have more power in resolving ASVs in comparison with sample-wise inference.

With kind regards, Vladimir

a-h-b commented 3 years ago

Hi Vladimir- The reason I've not originally wanted to include this kind of workflow has to do with what I want to influence the ASVs found in each sample - currently dadasnake offers two options: 1) nothing except the sample influences the sample (not pooled), 2) the same samples influence all samples in a study (pooled). The within-run-pooling kind of breaks this logic, and in the worst case ASV detection in strongly influenced by which run a sample was on, especially if there are different run sizes. But I do see that it has its advantages. I'll set the workflow up and document the caveat for future users. I'll include it in the next release. Best wishes - Anna

vmikk commented 3 years ago

Hello Anna! I see you point as well. And the goal is of course to make ASV inference as robust and deterministic as possible.

Thank you for all your hard work on Dadasnake, this software is very helpful! With kind regards, Vladimir

a-h-b commented 3 years ago

Hi Vladimir - so, the option is now in v0.7.6 . You can use dada: pool: within_run Have a lovely weekend - Anna

vmikk commented 3 years ago

Hello Anna! Wow, that's amazing! Thank you so much!

With kind regards, Vladimir