benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Error model fit #1343

Closed Andreas-Bio closed 5 months ago

Andreas-Bio commented 3 years ago

grafik

I think my error model is not converging. I included all the reads I possibly could in the error model calculation. It has very weird black lines that move far away from the sampled dots (there is a dip around 35). Any idea why this is happening?

Also what is an T2T error? Shouldn't it be 1 all the time?


> errF <- learnErrors(inputfiles, multithread = T, randomize = T, nbases = 1e20, MAX_CONSIST = 100)
404252934 total bases in 7873078 reads from 180 samples will be used for learning the error rates.
> plotErrors(errF, nominalQ=TRUE)
Warnings::
1: Transformation introduced infinite values in continuous y-axis 
2: Transformation introduced infinite values in continuous y-axis 

Thanks!

benjjneb commented 3 years ago

I suspect that you are working with data with binned quality scores, where this sort of error modeling fit pattern has shown up before. The central issue on this topic is #1307

I'd take a look there and see if that helps. Also, a "T2T" error is not an error, its the rate at which a T in the real sequence shows up as a T in the read. The fit there is not something to worry about though -- the fitting is performed entirely on the "errors" and T2T is just the residual after subtracting out the error terms.

Andreas-Bio commented 3 years ago

Thank you for the quick reply. I do not have binned quality scores. This is 300bp MiSeq forward read data and my amplicon is just 100bp, so I probably do not have a lot of different quality score values. grafik

after primer removal and truncation:

> x@quality@quality %>% as.character %>% strsplit(.,"") %>% unlist %>% table
.
      -       #       *       ,       /       :       ;       ?       @       +       <       =       >       0       1       2       3       4       5       6       7       8       9 
    122      31      14   32111       4    6125    4693     689   34608    4102   24477     101     461       3       3       4       5      28      18   16730    4806    6791    8689 
      A       B       C       D       E       F       G 
  10654    4251   76183   28877   77606  241158 3717236 
benjjneb commented 3 years ago

The dip at ~Q35 doesn't look ideal, and hopefully progress on the binned quality score error fitting issue can help here as well, as the very large numbers of certain quality scores, and the very small numbers at other scores, are what is causing this and is very similar to the challenge with binned quality scores.

More data in the error model fitting may help. But, I probably would just go with these error models. The goal with DADA2 error modeling is to get a pretty good error model, not a perfect error model.

cresil commented 3 years ago

Hi,

Most of your lower quality calls are from beyond the amplicon length.

Illumina has this weird setting that it keeps calling positions even though there is nothing to be sequenced anymore. I think those read through positions are causing issues with the error models. Basically they are indels which are not modelled.

Try truncating the reads by length.

Cheers

On Sat, May 22, 2021, 16:59 andzandz11 @.***> wrote:

Thank you for the quick reply. I do not have binned quality scores. This is 300bp MiSeq forward read data and my amplicon is just 100bp, so I probably do not have a lot of different quality score values. [image: grafik] https://user-images.githubusercontent.com/19622117/119231011-3cdd3b80-bb0e-11eb-853d-1ed87cc08338.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1343#issuecomment-846420018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4JZDVEGIXIZZWEFD7PHOLTO7BEFANCNFSM45JJ6MNQ .

Andreas-Bio commented 3 years ago

Thank you for your feedback, but I discarded all reads without rev+forw primers. Then I cut those primers. I can be sure to not have any "overhang" quality scores in my data.

cresil commented 3 years ago

Are you sure? Your quality profile suggests that some reads have 300bp length.

On Tue, May 25, 2021, 01:20 andzandz11 @.***> wrote:

Thank you for your feedback, but I discarded all reads without rev+forw primers. Then I cut those primers. I can be sure to not have any "overhang" quality scores in my data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1343#issuecomment-847416384, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4JZDTABZM3UBXV4JTYHHDTPLNLDANCNFSM45JJ6MNQ .

Andreas-Bio commented 3 years ago

Sorry I should have been more clear. The quality profile is after adapter removal but otherwise the reads are untouched. I just wanted to show what my data looks like at the beginning of the pipeline. This is right before I use learnErrors: grafik

So maybe the problem is that some of the rare species have longer amplicons than the common species?

JacobRPrice commented 3 years ago

@andzandz11 what gene/region are you sequencing? It looks like you have a tiny fraction of reads that survived primer removal and still have a length of ~300 bp.

I'm with @cresil that you should consider truncating at/around 100 bp (or whatever length that would be including primer/adapter) if you aren't expecting any sequences significantly longer than that. Those sequencing artifacts can throw off model fitting.