btupper commented 3 years ago

multistep ASV workflow

It seems that eDNA datasets are, at least for now, mostly edge cases - that is each new sample submitted to the workflow brings unlooked-for qualities. The pipeline, in its original conception, was designed to be a simple drop-and-run process. That design makes it difficult to ascertain the needs of a particular dataset analysis before running the costly dada and and taxonomy matching steps.

To accomodate the fluidity of the eDNA datasets, we proposed to split the ASV workflow into at least 3-steps: preprocessing, user supervision, and processing.

1 Preprocess

User generates config input.yaml
Preprocess
- copy input.yaml to outdir/input.yaml as archive
- run cutadapt? (dump to outdir/cutadapt)
- quality metrics (dump to outdir/preprocess)
  - overlap
  - read lengths
  - cutoffs
  - reads-in and reads-out from filter and trim
- plots (dump to outdir/preprocess)
  - error plots
  - quality profiles
- write new updated config outdir/preprocess/input-preprocessed.yaml as archive
- write duplicate updated config outdir/input-supervised.yaml
- email user

2 User supervision

User review preprocess outputs
"Should I stay or should I go?"
If a go then user adjust config outdir/input-supervised.yaml

3 Process

run workflow using outdir/input-supervised.yaml
email user

robinsleith commented 3 years ago

I like it! So would filter and trim run in preprocess and then we point to outputs from that step in process? Assuming all looks good in user supervision?

robinsleith commented 3 years ago

Cutadapt should be the first step as all downstream steps assume primers have been trimmed off.

btupper commented 3 years ago

Preprocess through learn_errors(). Then process starting with filter_and_trim() but with option to skip over that part and go straight to run_dada().

BigelowLab / edna-dada2