learnError() gives an ascending error model

EnigmaSalome commented 4 years ago

Hello,

First off, thank you for the very descriptive tutorial on dada2, it is amazing.

I am currently faced with a weird error plot, which I don't quite understand, so any help or insight will be appreciated!

I have 90 human fecal samples that were sequenced with an Ion PGM platform (single-end), coming from the paper "Limited prolonged effects of rifaximin treatment on irritable bowel syndrome-related differences in the fecal microbiome and metabolome" (Gut Microbes, 2016).

The read quality is overall okay, the quality score staying around 30 (decreases a bit at the end of the reads), and the reads length is about 500bp. I filtered the reads with the following parameters:

trimLeft = 15
maxEE = 2
truncQ = 10
minLen = 200

Then I used the learnErrors() function with multithread = TRUE and randomize = TRUE. And weirdly, on the plot, the error frequency doesn't decrease for higher consensus quality scores, it looks rather the opposite. I can't understand it.

Default_nbases_plot

The error rates were estimated from 16 samples, and converged after 6 self-consistency steps:

Default_nbases_Console

And when you look at the output dada2:::checkConvergence(err), it looks normal (if I understand correctly what is written on the dada2 FAQ):

Default_nbases_check

I tried increasing the nbases parameter to 1e10, and then the error rates are estimated from the 90 samples and converge in 7 self-consistency steps, resulting in almost the same plot:

High_nbases_plot

I also tried being more stringent on the filtering, using maxEE = 1 (but then I lose 98% of my reads), which somewhat improves a little bit the error rates plot (looks more like it's decreasing a little):

maxEEFilt_nbases_plot

So here it is, I don't quite understand what this means, and I don't feel very confident trying to infer sequence variants from this.

Thank you in advance for any insight you may have!

Cheers, Salomé

=> Just some more information that may be relevant: The authors used the Ion 16S Metagenomics Kit, which provides primers to amplify 6 regions (V2, V3, V4, V6-7, V8, V9). The primer sequences in the kit are proprietary, so we don't know them, and in each sample all of these amplified regions are mixed. There is a QIIME forum where they discuss this issue and advise to use trimLeft = 15 as parameter in the filtering.

benjjneb commented 4 years ago

Huh. I've never seen a fitted error model that looks like this before, ever. I do have very little experience with Ion Torrent data though, and none with the 16S Metagenomics Kit you are using though.

That said, it seems like the error model (black line) is well-matching the observations (points) over most of the range of quality scores, it is just at the tail of high quality scores where it is diverging. My guess is that because there are relatively few observations of those quality scores, and thus the weighted loess fitting that is being done is putting too low a weight on matching those points.

As weird as it looks, it is probably usable as is. If I was being picky, I might force the error-rates at the high-quality parts to something like 0.01 by hand. This is hacky, but will make the whole error model better fit over the full range of quality scores. The error model is just a matrix, so by-hand changes are straightforward to perform.

EnigmaSalome commented 4 years ago

Thank you very much for the quick response!

It makes sense, looking at the quality profiles, there are few reads with a quality score above 30, so that can explain why the loess fitting is not as good for the high quality scores.

I will try to modify the error rates and see how it influences the downstream pipeline. Why is it important that the error model should fit better the observations? For the dada() function, it uses only the error model (and not the observations)? Just asking to make sure I understand.

benjjneb commented 4 years ago

For the dada() function, it uses only the error model (and not the observations)?

Correct.

EnigmaSalome commented 4 years ago

So I tried to change a little bit the error rates so the model better fits the observations:

err_modified

It doesn't seem to affect much the dowstream pipeline. I just have 16 extra sequence variants inferred (out of 12,000) with this manually modified error rate model (compared to the initial error model that looked weird). And then relative abundances at the Phylum level look the same. I'll go further in the data analysis and post something if there are some abnormalities, but it doesn't seem like it so far.

Thanks again for the help!

mmp3 commented 3 years ago

For what it's worth, I am working with a different dataset that was also generated from the Ion Torrent PGM platform and from the Ion 16 Metagenomics Kit from Thermo Fisher, and my error rate profiles are qualitatively very similar as those reported by @EnigmaSalome. In particular, the error rates are ascending for Q > 30 for both the observed data (points) and fitted model (black line).

The suggestion by @benjjneb and confirmation by @EnigmaSalome that there are relatively few bases with Q > 30 is an excellent point.

Perhaps this is just an inherent property of the Ion Torrent PGM platform or its output for libraries from the Ion 16S Metagenomics Kit. Or perhaps it is just an artifact of the paucity of bases with Q > 30, as suggested.

benjjneb / dada2

learnError() gives an ascending error model #1019