Closed Andreas-Bio closed 5 months ago
I suspect that you are working with data with binned quality scores, where this sort of error modeling fit pattern has shown up before. The central issue on this topic is #1307
I'd take a look there and see if that helps. Also, a "T2T" error is not an error, its the rate at which a T in the real sequence shows up as a T in the read. The fit there is not something to worry about though -- the fitting is performed entirely on the "errors" and T2T is just the residual after subtracting out the error terms.
Thank you for the quick reply. I do not have binned quality scores. This is 300bp MiSeq forward read data and my amplicon is just 100bp, so I probably do not have a lot of different quality score values.
after primer removal and truncation:
> x@quality@quality %>% as.character %>% strsplit(.,"") %>% unlist %>% table
.
- # * , / : ; ? @ + < = > 0 1 2 3 4 5 6 7 8 9
122 31 14 32111 4 6125 4693 689 34608 4102 24477 101 461 3 3 4 5 28 18 16730 4806 6791 8689
A B C D E F G
10654 4251 76183 28877 77606 241158 3717236
The dip at ~Q35 doesn't look ideal, and hopefully progress on the binned quality score error fitting issue can help here as well, as the very large numbers of certain quality scores, and the very small numbers at other scores, are what is causing this and is very similar to the challenge with binned quality scores.
More data in the error model fitting may help. But, I probably would just go with these error models. The goal with DADA2 error modeling is to get a pretty good error model, not a perfect error model.
Hi,
Most of your lower quality calls are from beyond the amplicon length.
Illumina has this weird setting that it keeps calling positions even though there is nothing to be sequenced anymore. I think those read through positions are causing issues with the error models. Basically they are indels which are not modelled.
Try truncating the reads by length.
Cheers
On Sat, May 22, 2021, 16:59 andzandz11 @.***> wrote:
Thank you for the quick reply. I do not have binned quality scores. This is 300bp MiSeq forward read data and my amplicon is just 100bp, so I probably do not have a lot of different quality score values. [image: grafik] https://user-images.githubusercontent.com/19622117/119231011-3cdd3b80-bb0e-11eb-853d-1ed87cc08338.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1343#issuecomment-846420018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4JZDVEGIXIZZWEFD7PHOLTO7BEFANCNFSM45JJ6MNQ .
Thank you for your feedback, but I discarded all reads without rev+forw primers. Then I cut those primers. I can be sure to not have any "overhang" quality scores in my data.
Are you sure? Your quality profile suggests that some reads have 300bp length.
On Tue, May 25, 2021, 01:20 andzandz11 @.***> wrote:
Thank you for your feedback, but I discarded all reads without rev+forw primers. Then I cut those primers. I can be sure to not have any "overhang" quality scores in my data.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1343#issuecomment-847416384, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4JZDTABZM3UBXV4JTYHHDTPLNLDANCNFSM45JJ6MNQ .
Sorry I should have been more clear. The quality profile is after adapter removal but otherwise the reads are untouched. I just wanted to show what my data looks like at the beginning of the pipeline. This is right before I use learnErrors:
So maybe the problem is that some of the rare species have longer amplicons than the common species?
@andzandz11 what gene/region are you sequencing? It looks like you have a tiny fraction of reads that survived primer removal and still have a length of ~300 bp.
I'm with @cresil that you should consider truncating at/around 100 bp (or whatever length that would be including primer/adapter) if you aren't expecting any sequences significantly longer than that. Those sequencing artifacts can throw off model fitting.
I think my error model is not converging. I included all the reads I possibly could in the error model calculation. It has very weird black lines that move far away from the sampled dots (there is a dip around 35). Any idea why this is happening?
Also what is an T2T error? Shouldn't it be 1 all the time?
Thanks!