benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
474 stars 144 forks source link

Negative Quality Scores #968

Closed NkaziN closed 4 years ago

NkaziN commented 4 years ago

Not sure whether anyone else is experiencing this problem, but I'm wondering how Dada2 determines the quality scores.

I'm currently running a demultiplexed sample with several different fastq files. All of them were produced in the same sequencing run, but there's a discrepancy in the quality scores, similarly to what is shown below for D7.

image

Almost all samples have similar quality scores in the 40-60 range. However, a few samples, like D7, have quality scores that dip below 0. I've manually set the phred quality score to +33 for all samples, and when I open up the fastq files for D7 and the adjacent files, they look almost identical (though D7 has fewer reads). I have also tried solution https://github.com/benjjneb/dada2/issues/838 with no success.

Since there are negative quality scores, DADA2 won't let me process the samples.

Does anyone know:

Thank you!

benjjneb commented 4 years ago

What sequencing instrument produced these fastq files?

I'm wondering how Dada2 determines the quality scores.

See discussion here: https://github.com/benjjneb/dada2/issues/682

In short it tries to automatically detect the encoding as base 33 or base 64, but this can be overriden by providing the qualityType="FastqQuality" (base 33) or qualityType="SFastqQuality" (base 64) arguments.

NkaziN commented 4 years ago

All the reads are from an Illumina MiSeq. I've manually set the quality type to FastqQuality, but I'm still noticing the same trend.

I just opened one of the files with negative quality scores and noticed that it only contained three sequences. Could a low sequence count be the problem?

benjjneb commented 4 years ago

Could a low sequence count be the problem?

No that shouldn't have any affect. Can you share an example file that is resulting in negative quality scores? The smaller the better.

NkaziN commented 4 years ago

I've put together this sample folder to illustrate the issue. Running Dada2 on the "demultiplexed" folder results in the error.

The Rplots pdf shows the apparent drop in read quality for B9 and a few other samples.

benjjneb commented 4 years ago

You'll need to explicitly dereplicate these files, and define the qualityType in that step, and then it should work fine in my testing:

filt <- list.files("filtered", pattern=".fastq", full.names=TRUE)
drp <- derepFastq(filt, qualityType="FastqQuality")
dd <- dada(drp, selfConsist=TRUE)

The issue is that on some of these samples there are a tiny number of reads, and the automatic detection of the quality score encoding is failling. This brings up another point though... dada2 is not really appropriate to use on samples with <100 reads, so yo should consider that larger point as well. I'd also strongly suggest pooling samples together for better sensitivity given the small library sizes in these samples.