benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Advice on choosing maxEE and omission/inclusion of bad quality samples #1602

Closed elpape closed 4 months ago

elpape commented 2 years ago

Dear dada2 enthusiasts,

Upon analyzing my HTS (Illumina Miseq 2x300bp) data generated after amplification of the CO1 I3-M11 fragment (ca. 396 bp without primers) I am faced with doubts about the correct or most appropriate parameters for filterAndTrim() and whether or not to include or omit samples of less good read quality.

See attachments for read quality profiles of forward and reverse reads per sample (R1: CO1_R1_quality_plots.pdf, R2: CO1_R2_quality_plots.pdf) and also an aggregated plot for forward (CO1_R1_agg_quality_plot.pdf) and reverse reads (CO1_R2_agg_quality_plot.pdf). As you will see, especially samples MUC005 and MUC021 (R1 is also quite bad, dropping already in quality around 100 bp) are not really great in terms of read quality. The blue horizontal line in my plots denotes the quality score of 30 treshold.

I have decided to trim R1 and R2 at 240 bp and 180 bp, respectively, based on the aggregated read quality plots and output of figaro, and also to make sure sufficient overlap is maintained for merging R1 and R2.

I tried:

out_standard <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen = c(240,180),
              maxN = 0, maxEE = c(2,2), truncQ = 2, rm.phix = TRUE,
              compress = TRUE, multithread = F)

which resulted in only 36% of the reads for sample MUC021 being retained, which I thought was quite low (more reads were retained in the other samples with the median being 79% of the reads passing)

I then decided to relax maxEE as suggested in the dada2 pipeline tutorial so that maxEE=c(2,5) which resulted in 38% of the reads in MUC021 being retained.

I have also attached the plots for the error models for the R2 reads (maxEE=2: Errors_R_maxEE_2.pdf, maxEE=5: Errors_R_maxEE_5.pdf). There are some differences concerning the deviations of the observed error rates and the estimated error rates for some transitions (encircled in blue), but to me it seems they do not differ that much. Or am I missing the ball here?

My questions are as follows:

  1. should I go for maxEE=c(2,5) as more reads pass the filterAndtrim step and the fact that the plots of the error model are not much worse (I think) than the ones for maxEE=2? I did hear that maxEE as high as 5 may be frowned upon/is not accepted by reviewers?
  2. should I perhaps consider to leave out MUC021 (and MUC005) at the start of the pipeline? or does it not really matter for the error model and sample inference etc.?

    I did read in issue 556 the response of Benjamin Callahan ( #https://github.com/benjjneb/dada2/issues/556) "The error-correction algorithm is robust to lower quality, up to a point. But when you have too much low quality sequence that you can't trim off, the ability to distinguish true sequences from errors starts to degrade (technically this depends on fraction of error-free reads in the data)." so, not sure if my samples are bad enough to warrant omission at the start of the dada2 pipeline.

    Any advice to this newbie in HTS data processing would be greatly appreciated!

    Thanks, Ellen

benjjneb commented 2 years ago

I would stick with maxEE=2. In general, you don't want to tailor your filtering parameters to the outlier bad quality samples. Those often occur because of an issue in library preparation that is specific to those samples (e.g. very low input DNA), and that just isn't going to be fixed or made better by trying to make filtering lenient enough to still get those probably bad reads to pass, and there is the potential to harm all the rest of the good libraries (a little bit) by not doing appropriate filtering for those.

As for when to exclude outlier poor quality samples, I usually do it after denoising, but if they are clear outliers based on pre-filtering quality metrics, it is totally valid to exclude them on that basis at that point.

elpape commented 2 years ago

Hi!

Thank you so much for your response! Concerning excluding samples before or after denoising, I'll also build the error model plots when these samples are excluded and check whether they are a "better fit".

best wishes, Ellen

nickp60 commented 2 years ago

Hi @benjjneb and @elpape , we have had similar questions about setting some sample exclusion criteria. You said that you usually exclude poor quality samples after denoising. I am worried that the poor quality samples would "poison" the error modeling, if the poor quality cycles don't actually reflect a systematic bias in that sequencing run. Is that a valid worry?

Is there a proportion of reads lost to filtering and trimming that you would consider valid grounds for excluding the sample?

nickp60 commented 2 years ago

This old thread mentioned that you wanted well over 10% (https://github.com/benjjneb/dada2/issues/232#issuecomment-296188769), but that was a while ago, so I'm curious if your intuition has changed or of you've settled on particular cutoffs.

benjjneb commented 1 year ago

Intuition is the same. Losing >90% reads at quality filtering step indicates a closer look.

nickp60 commented 1 year ago

Thanks so much @benjjneb ; we are trying to find a lower bound for that. I get that its a study-specific question without easy answers, but I'm trying to come up with criteria for our group to flag samples for re-sequencing/re-extraction. Agree that over 90% loss is bad, but what about 60% loss? or 80%? I guess the only way to answer it truly is to re-extract and re-sequence samples and determine how much the community differs. Would you think excluding samples with 60% loss is too conservative?

And back to the error learning step, am I worried unnecessarily about including bad samples? There are two cases to consider: one where all the samples in a batch are mediocre and some fall below a retention threshold, and the other case would be where most of the samples look good except for a few bad apples. My guess is that including "bad" samples in the first case woudn't be too bad as the errors/poor quality are systemic to that batch, but in the second case my gut would be to exclude bad samples since the bad quality is sample specific and probably not informative for the rest of the batch.

Thanks so much!

elpape commented 1 year ago

Hi @nickp60 :-) I had the same worry as formulated in your second question where I had the issue of most samples being of good quality and a few samples of bad quality. What I did was just compare the graphs output in the error learning step for (1) all samples and (2) all samples minus the "bad" samples. I did not see much of a diference in terms of the trends for the observed and the estimated error rates so I concluded that it was okay to include these "bad" samples, and I hope this was the correct way of handling this...

benjjneb commented 1 year ago

Would you think excluding samples with 60% loss is too conservative?

I agree that it is a study-specific question without a one-size fits all threshold. That said, "excluding samples with 60% loss" does not strike me as unreasonable.

I guess the only way to answer it truly is to re-extract and re-sequence samples and determine how much the community differs.

In larger studies, I would always recommend doing this, e.g. having a smaller set of samples that is resequenced on every run. Excellent as a process control, and can help guide questions like this.

And back to the error learning step, am I worried unnecessarily about including bad samples? There are two cases to consider: one where all the samples in a batch are mediocre and some fall below a retention threshold, and the other case would be where most of the samples look good except for a few bad apples. My guess is that including "bad" samples in the first case woudn't be too bad as the errors/poor quality are systemic to that batch, but in the second case my gut would be to exclude bad samples since the bad quality is sample specific and probably not informative for the rest of the batch.

By default the learnErrors method will include enough samples to reach 100M sequenced bases. In typical short-read amplicon sequencing to depths ~100k, this will bring in multiple samples, but if you are sequencing deeper (e.g. 1M reads per sample) then this number should perhaps be increased so that the error model isn't built off just one sample.

That said, I have not experienced any issues with the second scenario you are describing, assuming that PCR and library prep (and thus the expected error rates) is the same across samples. The one practical thing is that error models tend to be fit a bit better in lower complexity samples. But again, learnErrors is about building a pretty good error model, not a perfect error model.

nickp60 commented 1 year ago

Thanks so much the insight @elpape! That is reassuring.

And thank you @benjjneb, this is super helpful. We are putting together a big batch for re-sequencing, so I might post a followup then if the data shows anything surprising. And your point is well taken -- in a pool of tens or hundreds of samples, the number of reads from the poor samples should be such a minority for that case.