benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

How to handle Element AVITI data (16S amplicon sequencing) #1893

Closed mniku closed 5 months ago

mniku commented 8 months ago

I'm processing our first Element AVITI 16S amplicon sequencing data, using dada2 in QIIME2. I'm wondering how to do this optimally, as it appears that it behaves a bit differently from MiSeq data in dada2:

The phred quality stays much higher than in MiSeq usually, but to my surprise, I still need to truncate the reads just as short to get similar % accepted in DADA2. For more details & actual statistics, see this thread in the QIIME2 forums.

I was first wondering if the AVITI phred scores are a tad optimistic, but got a potentially interesting comment from AVITI bioinformatician. They think the scoring should be quite comparable vs MiSeq BUT: ”There are differences in the distribution of q scores within a read, however, which would be relevant to DADA2 filtering. AVITI data will have greater q score variance within a read--in AVITI data you would be more likely to find a single low q score in the middle of a high quality read.”

Therefore, they recommend we try tweaking maxEE. Does this sound like a good idea? How should we evaluate/validate the results?

We have a huge number of reads to begin with so that’s no problem, and by truncating close to minimum required overlap we get comparable % of reads through vs. MiSeq. But it feels stupid to throw away high quality data just because I don’t completely understand what’s going on.

The QIIME2 guys recommended me to open an issue here, because this calls for a deep level understanding of dada2.

benjjneb commented 8 months ago

”There are differences in the distribution of q scores within a read, however, which would be relevant to DADA2 filtering. AVITI data will have greater q score variance within a read--in AVITI data you would be more likely to find a single low q score in the middle of a high quality read.”

That does make sense assuming it is true. maxEE is calculating the average expected errors in a read. This is different than calculating average quality score, as quality scores are first transformed into their probability of representing an error (P = 10^-(Q/10)), which makes isolated low quality scores much more impactful to a maxEE filter than an average Q-score filter. There was a paper on this approach that gives more reasoning:

Therefore, they recommend we try tweaking maxEE. Does this sound like a good idea? How should we evaluate/validate the results?

We have a huge number of reads to begin with so that’s no problem, and by truncating close to minimum required overlap we get comparable % of reads through vs. MiSeq. But it feels stupid to throw away high quality data just because I don’t completely understand what’s going on.

Throwing away lower quality data is not stupid. At the end of amplicon sequencing, the parameters we are estimating are the relative abundances -- the absolute number of reads that get through the pipeline is not important beyond how it affects the measurement of the relative abundances. So, pushing through more but lower-quality data can backfire, in that while more reads make it to the end, the measurements of the relative abundance get worse because of the increased noise. Of course there is some optimal level which will vary among datasets, but in general trying to maximize reads passing filtration is not that important. More important is to avoid any systematic loss of data (e.g. of longer amplicons due to merging, etc.). With that said, when I look at the qza's that you posted in the Q2 thread, I'm more concerned abotu the large loss at chimera removal, e.g. nearly 50% loss samples SF1_01 and ZYMO. This is beyond what is expected even for high PCR cycle numbers, and is usually (90%+) due to unremoved technical bases (primers most often, but sometimes barcodes/adapters) that have ambiguous nucleotides in them.

mniku commented 8 months ago

Many thanks Benjamin!

Yes, the chimera% is indeed high. I have verified that we did remove all the technical bases (using cutadapt with the 341F & 785R primer sequences, and checking by zgrep that this was succesful). And not especially high cycling either (14 cycles in our lab with the 16S primers, and I think 18 cycles in the sequencing core with the technical primers). So I don't quite understand what's the problem. I think we have experienced this also in MiSeq previously. But we always get very accurately the expected composition of the ZymoBiomics Microbial Community Standard (of course super simplistic).

Avi-Til commented 3 months ago

Hi @mniku, We are also comparing MiSeq vs AVITI Dataset from running the same physical library pool on Illumina and AVITI. We also noticed a significant difference in the number of reads discarded at the Chimera removal step in the AVITI Run but not the Illumina Run. Were you able to identify the cause and a solution? Were there any important observations from tweaking the maxEE value?

I would greatly appreciate your input. Thanks!

To add on, on observation of the DADA2 statistics, our AVITI Dataset had nearly 85-89% filter pass through, 42-44% merged reads, and only 23-27% passed Chimera removal, which I think is mildly different from your DADA2 statistics.

mniku commented 3 months ago

Good to hear (I guess) that we are not alone in this! I’m afraid we didn’t really SOLVE the issue. We just optimized the truncation carefully so that we got the most of the data and went with that (there was more than enough accepted reads anyway, as the sequencing was much deeper than with MiSeq). We didn’t try tweaking maxEE because we didn’t quite know how to validate the results of the tweaking.

IrshadUlHaq1 commented 3 months ago

@mniku The genomic center at our institute proposed that we use their new AVITI technology for amplicon sequencing (16S rRNA gene and the ITS). I thought it was a good proposal because the sequencing depth is a no-brainer. However, I wanted to double check if people have been using QIIME2 for AVITI datasets, and on QIIME2 forum I came across your thread and the potential caveat you mentioned while using DADA2. That brought me here, and I just need your opinion on whether I should go for the AVITI chemistry or stick with the Miseq. You mentioned that you optimized the truncation carefully to retain most of the data. Can you elaborate on that optimized truncation a little bit? I have no preference as long as I reach my goal without any hiccups and deep troubleshooting.

mniku commented 3 months ago

I’d probably go with AVITI if that is more cost effective. I wouldn’t say the quality (in terms of DADA2) was inferior to Miseq, it just seemed worse (in DADA2) than we expected based on the q scores (= more reads were discarded than expected based on the very high q scores). I’m guessing this might be (at least in part) because the DADA2 algorithm is optimized for Miseq (see above; but @benjjneb is of course the expert). I’m hoping this just means that we lose a bit more data (which might be kind of OK due to the great sequencing depth) and doesn’t generate other issues. We always include the Zymobiomics microbial community composition standard and that was fine. With the optimization of truncation, I mean that we tried several different truncation lengths for a subset of data and selected the settings which gave us the highest numbers of accepted reads. I never really did that this way previously and only now noticed that it can make quite a big difference (I mean, even within the obvious limits set by minimal overlap and the eyeballing of the q scores).

andressamv commented 6 days ago

Following up here, I am curious if we can follow the same steps for merging different runs (before chimera removal) for AVITI and MiSeq data. @benjjneb, do you think we can combine these datasets if the libraries are prepared the same way?

benjjneb commented 4 days ago

If the processed amplicons start/end at the same position, then yes the ASV tables can be combined. I would consider the sequencing technology as a batch effect when evaluating statistical models though.