a-h-b / dadasnake

Amplicon sequencing workflow heavily using DADA2 and implemented in snakemake
GNU General Public License v3.0
45 stars 17 forks source link

Feature request / general question re: parameter sweeps for DADA2 #20

Open jcmcnch opened 2 years ago

jcmcnch commented 2 years ago

Hi Anna and co-authors,

Thanks for the wonderful work on this pipeline - it's really a great resource for the whole community.

Q/request: I am interested in using dadasnake for doing parameter sweeps to test the effect of denoising parameters on the quality of DADA2's ASVs.

Background: We have noticed that DADA2 can create spurious ASVs, based on results from sequencing mock communities. In our experience, this is rare (and only affects specific sequencing runs), but is potentially problematic for us as these spurious ASVs can comprise up to 7 or 8 % of the mock community reads and therefore presumably are also causing similar problems in the environmental samples. Most of these artifacts are 1-mismatches to the true mock sequence so we believe it to be an artifact of DADA2's processing not contamination or bleedthrough. So, we want to try a variety of different parameters recommended by Ben Callahan et al in a combinatorial manner to see if we can eliminate these artifacts. I have experience doing similar parameter sweeps with snakemake before and your pipeline looks to be an excellent place for me to at least begin this kind of analysis. Parameters of interest would be those contained in config/config.default.yaml:

dada:
  band_size: 16
  homopolymer_gap_penalty: NULL
  pool: false
  omega_A: 1e-40
  priors: ""
  omega_P: 1e-4
  omega_C: 1e-40
  gapless: true
  selfConsist: false
  no_error_assumptions: false
  kdist_cutoff: 0.42
  match: 4
  mismatch: -5
  gap_penalty: -8
  errorEstimationFunction: loessErrfun
  use_quals: true

For the above parameters, I would try to figure out what a good range of values would be and then run DADA2 for each combination of relevant parameters to try and empirically "tune" DADA2 to see if I can make the artifactual ASVs disappear.

My understanding of dadasnake's mode of operation: Based on my understanding, each type of operation (e.g. dada2-paired) reads input from a YAML config which is then passed to the R scripts via snakemake. As such, it does not seem completely straightforward to run parameter sweeps using snakemake's built-in ability to expand parameters.

Potential solutions: Without altering your pipeline, it seems one solution would be to define a large number of config files corresponding to the desired parameter sweeps. But this seems a bit unwieldy so I was thinking the best way to do it would be to use your config files and R scripts to run DADA2, but write my own Snakefile defining the parameter sweeps I want. If so, I will probably fork your repo and try to define an additional rule for this kind of scenario.

Do you have any advice on this? Am I missing an easy way to implement my desired behaviour with the pipeline as currently written?

Thanks a lot for your advice and again for making these really great scripts available for the benefit of the whole community.

Cheers, Jesse

a-h-b commented 2 years ago

Hi Jesse - sorry for not getting back to you. I had no experience to share. Did you figure out a way? -A