Analysis of MiSeq and iSeq fastq files using DADA2

ong8181 commented 4 years ago

Hi DADA2 developers,

I have been using MiSeq so far, but recently my group bought iSeq and try to analyze iSeq sequence data by DADA2. iSeq generates basically the same outputs as MiSeq does, but I found the quality scores (Q-scores) are very different. MiSeq fastq file contains 0-39 Q-scores, but iSeq fastq file contains only three Q-scores (11, 25, 37).

DADA2 can run with the iSeq fastq files, but I am wondering whether analyzing iSeq data using DADA2 is appropriate or not. To briefly examine the effects of the different Q-scores, I have performed several analyses using my own sequence data (scripts and results are a bit long, so I posted them in my Github repository: https://github.com/ong8181/random-scripts/tree/master/04_MiSeq_vs_iSeq_DADA2)

General procedure of my test is as follows:

Partial 16S rRNA sequences were amplified using 515F-806R, and the amplicons were sequenced by MiSeq V2 250 x 2 bp kit.
Started from MiSeq fastq files (0-39).
Manually converted MiSeq Q-scores to iSeq Q-scores using a shell script.
These two types of fastq files were analyzed identically using DADA2.
Representative sequences were saved as "ASV.fa", and taxa information was assigned.
ASV table, sample information, and taxa information were imported as phyloseq objects.
Three types of visualizations were done: Barplots of MiSeq-style and iSeq-style fastq files, sequence reads of MiSeq-style v.s. iSeq-style fastq files and relative abundance of MiSeq-style v.s. iSeq-style fastq files.

I guess that, based on the results of my analysis and the algorithm of DADA2, analyzing iSeq data should be fine, but I would be glad if you could give me your thoughts on this issue.

Best regards, Ushio

benjjneb commented 4 years ago

Wow, awesome set of analyses and Github repository, thanks for that work!

Based on what you see there, I think it confirms what I expect, which is that DADA2 will largely work OK with iSeq type quality scores. That said, there is some concern that denoising error rates might be moderately higher, in particular there might be a higher number of false-positive rare ASVs, in iSeq data. This is for two reasons, first the binned quality scores have less information which makes accurate denoising more difficult, and DADA2's error model fitting procedure was built for "normal" Miseq quality scores distributions, and can be non-ideal for binned quality scores. This has been discussed before and there is quite a bit of useful information in some other threads on this issue: https://github.com/benjjneb/dada2/issues/791

I do think two additional simple diagnostics could be useful, what does the output of plotErrors look like in the iSeq type data? And what is the histogram of ASV abundances in both datasets? (i.e. are there more rare ASVs in the iSeq type data?)

ong8181 commented 4 years ago

Thank you so much for your reply.

Outputs of plotErrors look like follows (these are also available at "03_SeqAnalysisDADA2_xxxOut" in the repository):

MiSeq error plot

iSeq error plot

As in #791, estimated error rates decrease sharply at around Q=30-35.

Also, I have checked histograms of ASV read counts as well as ASV relative abundance.

There is no big difference between MiSeq and (simulated) iSeq data. Slightly more rare ASVs are found in MiSeq data in terms of relative abundance (bottom panel), but this is probably because greater read counts of relatively abundance taxa derived from MiSeq data (top panel). Analyzing iSeq data with DADA2 looks fine at least when we are interested in obtaining general overview of microbial communities.

benjjneb commented 4 years ago

Analyzing iSeq data with DADA2 looks fine at least when we are interested in obtaining general overview of microbial communities.

Yeah, given the analyses you've shown here, I feel pretty good about that conclusion as well.

benjjneb / dada2

Analysis of MiSeq and iSeq fastq files using DADA2 #1083