benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Quality score issue for Pabbio Sequel I data #906

Closed Wei201910 closed 4 years ago

Wei201910 commented 4 years ago

I recently read your paper titled “High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Research, 2019.” I am currently applying your dada2 workflow to our PacBio Sequel I data. However I couldn’t run through ‘learnErrors’ step due to not enough unique reads from ‘derepFastq’ step.

I also tried the datasets from NCBI SRA and it works fine. I then checked the difference between NCBI data and our data and find out your data has unique quality scores for each base. But our Pacbio data doesn’t have these type of scores (shown as below). I tried re-generating fastq files using SMRT Link 7.0 but still didn’t get the quality scores as you mentioned in your paper.

May I ask you how you generate fastq files with quality scores for your NCBI SRA datasets?

Best Regards,

@SRR9089357.70 70 length=1569 GAGTGCTACTCTAGTAGGTTACCTTGTTACGACTTCA +SRR9089357.70 70 length=1569 m~~~q~~~~~~~~

m54042_171220_112913/4588425/ccs 4 0 255 * 0 0
AGGGTTTGATTATGGCTCAGATTGAACGCTGGCGGC 555555555555555555555555555555555555555

benjjneb commented 4 years ago

Usually this is because the folks who deposited this data incorrectly trimmed it and produced invalid fastq files. We've seen this before, where the reads are trimmed, but the quality scores were not trimmed in the same way, and I'm seeing the same thing in your example fastq lines: The sequence data is longer than the quality score data.

I'm not sure how to fix it. One could by hand trim the quality scores to the same lengths, but the trimming/truncation on the sequences may not have been entirely on the tails, so that could make for incorrect correspondence between quality scores and bases.

Wei201910 commented 4 years ago

Thanks for your help. The fastq reads I provided here are only copy/pasted samples to show the quality score coding difference. My question is how i could generate dada2 compatible reads with correct quality score instead of '5555'. I tried using 'qualityType' to pass fastq qualaity to dada2 function. However, it still doesn't show the right quality score coding.

benjjneb commented 4 years ago

Ah I think I misunderstood your question.

Yes, fastq is a poor "format" in that it is non-standard, in particular the encoding of the quality scores can be ASCII+33 or ASCII+64 depending on the software used. DADA2 relies on the ShortRead package to read in fastq files, which tries to autodetect the quality encoding. However, this can go wrong, particularly when there is very little diversity in quality scores. You can provide the quality encoding yourself with the qualityType command, as you noticed. I think you just need to try the two options ("FastqQuality" and "SFastqQuality") and find the one that is working correctly.

Note that the CCS command will generate quality scores up to 93.

Wei201910 commented 4 years ago

Thanks, I tried qualityType command in dada2, but i am not sure if ShortRead package successfully detected the quality score encoding for Pacbio sequel ccs file. Because the quality scores are the same as the scores in ccs file. The ccs command to generated ccs format file with the quality score '55555555'. It seems that these score is a bit low for dada2 to process. I am also wondering if dada2 has any limitations on Pacbio read quality score? For example, if the read quality score is too low and dada2 is not suitable for analysis these type of data?

benjjneb commented 4 years ago

55555555 does not mean the quality scores is "5". The quality scores is ASCII encoded, and so it corresponds to the ASCII code for the character 5 (which is 53) minus the offset used (either 33 or 64). Pacbio uses ASCII-33 encoding, so 5 in the CCS fastq file is equal to a quality score of 53-33 = 20.

Wei201910 commented 4 years ago

Thanks you very much, it might not be the problem of quality score in order to run through dada2 workflow. I think it could be the reason of not enough unique reads to estimate error model using dada2 algorithm. Or our data doesn't need to use dada2 to denoise errors from PCR steps?