McMinds-Lab / analysis_templates

Basic data analysis scripts to be modified for each project
0 stars 4 forks source link

create sensitive dada2 workflow #28

Open rmcminds opened 1 year ago

rmcminds commented 1 year ago

dada2 doesn't recommend merging and then denoising, because the quality scores reported by a machine have a different relationship with the actual error rate than the quality scores generated by merging algorithms. I can see this quite clearly in a dataset where I separately use learnErrors on reads that have been merged with vsearch vs reads that could not be merged and were instead concatenated (the second category is also biased to have more errors just because there could be reads that should have been merged but couldn't, but that is also worth exploring...)

perhaps we could use dada2's learnErrors function on different subsets of the data to simply correct Qscores in a fastq, then pool the different subsets together for a single denoising. I considered simply denoising subsets separately, but there may be cases where reads in different subsets should be pooled into a single ASV. this could be pretty simple: use the loess error model to simply change the quality score and re-write the fastq.

this workflow could be useful in edge cases like fungal ITS where there's so much length variation; we don't want to just use the forward reads because we'd lose info from the rev; we don't want to use dada2 separately on fwd and rev as currently recommended because that has produced artificial chimeras in my experience, and we don't want to merge reads and discard all those not merged; because many of the unmerged ones could simply be too long to have significant overlaps