benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

V4 150bp x 2 filtering issue #140

Closed crfitzpat closed 8 years ago

crfitzpat commented 8 years ago

I amplified the 16S V4 region and sequenced using 150bp x 2 reads. I'm using standard filtering options:

fastqPairedFilter(c(fnFs[i], fnRs[i]), c(filtFs[i], filtRs[i]), trimLeft=c(10, 10), truncLen=c(140,140), maxN=0, maxEE=2, truncQ=2, compress=TRUE, verbose=TRUE)

I have 4 MiSeq runs total. 3/4 runs exhibit normal behaviour throughout the pipeline i.e. expected number of reads passing filter and merged after dada. One run is exhibiting some odd behaviour, only ~20% of reads are passing the filtering step and far fewer are merging (relative to my other runs). After fiddling around with the filtering parameters I discovered that the reduction in filter passing was caused exclusively by the reverse read and if I remove the trunLen and trimLeft arguments I see ~90% pass the filter (this is about what I see for the rest of my runs).

I think this means the read length, and not the quality in my reverse reads, for this particular MiSeq run, are variable and this is causing the filter step to remove ~80% of my reads. Though when I change the truncLen argument to be more inclusive I don't recover any more through the filter.

Any thoughts to diagnose the problem with the reverse reads from this run would be greatly appreciated!

DADA2 is a wonderful thing, btw.

benjjneb commented 8 years ago

Glad DADA2 has been useful for you!

The easiest way to diagnose this would be if you could share a portion of the unusual run, for example the unfiltered F/R fastq file of one sample. Is that possible? My email is benjamin DOT j DOT callahan AT gmail DOT com if the files are small enough.

Also, could you post the exact modified filtering command that gave the different results on the unusual run?

crfitzpat commented 8 years ago

Thanks for the speedy reply!

So my initial filtering is:

fastqPairedFilter(c(fnFs[1], fnRs[1]), c(filtFs[1], filtRs[1]), trimLeft=c(10, 10), truncLen=c(140,140), maxN=0, maxEE=2, truncQ=2, compress=TRUE, verbose=TRUE)

Which yields about 20% of the total reads

When I do this:

fastqFilter(fnFs[1], filtFs[1], trimLeft=10, truncLen=140, maxEE=2, truncQ=2, maxN=0, compress=TRUE, verbose=TRUE)

I get 90% of the forward reads passing filter. but the exact same function with the reverse reads yields 20% reads passing filter. This tells me it is the reverse reads causing the problem.

When I do this:

fastqFilter(fnRs[1], filtRs[1], maxEE=2, truncQ=2, maxN=0, compress=TRUE, verbose=TRUE)

I recover 90% of the reverse reads. This tells me that maybe the length threshold is what is causing these reverse reads to be filtered out.

The weird thing is that this only happens for my second run which makes me think either something went wrong with the amplifications for these samples or something wen wrong during the sequencing of the second read from these samples.

I've sent a link to the unprocessed samples. Thanks for the help!

fanli-gcb commented 8 years ago

Anything weird in the run metrics? Unless you're using an unusual amplicon strategy, there shouldn't be any variability in the read lengths.

benjjneb commented 8 years ago

The unusual thing in this run is that a large fraction of the reverse reads (~80%) have at least one base at the start that has quality score 2 (see big drop down at the start):

image

This runs afoul of the truncQ=2 default value. That parameter truncates each read at the first position with a quality score of 2 or less. This allows all the reads to pass the filter when turning off truncLen as no length checking is performed, but most of the reads are just a single nucleotide long.

The fix here is to set truncQ=0 (or perhaps even better truncQ=c(2,0)) in the filter function call for this run.

I do have to say this behavior is somewhat confusing, because the trimLeft=10 would seemingly remove this problematic base. However, truncQ is enforced before trimLeft is applied. I may change the order of those operations.

fanli-gcb commented 8 years ago

It might be worth checking the thumbnail images on the sequencer (or asking your core to do the same) to see what happened that cycle. They may owe you a free run...

crfitzpat commented 8 years ago

Thanks for looking into this closely! This explanation makes sense given the order of operations. I'll see how the analysis works with truncQ=0 for the reverse reads.

@fanli-gcb - The run metrics looked great, excellent Q score, optimal cluster density, good distribution across my samples. Maybe this cycle hit on a particularly low diversity locus across V4... though we spiked with 5% PhiX so this shouldn't be a problem. I'll check with the sequencing centre about the thumbnail image, haven't heard of this check before.

Thanks again! And if you ever decide to make a DADA2 t-shirt design I'll buy heaps!

fanli-gcb commented 8 years ago

@crfitzpat - Are you using 515F/806R primers for V4? We've done tons of amplicon sequencing (V1V2, V4, ITS) and never had that particular issue. That being said, I do remember one run where we did get a single base (2nd base in the index read, oddly enough) that had really low Q scores and ended up read as N. I forget exactly what happened, but Illumina tech support is good about replacing reagents when things like this happen. Do you see the same valley in Q score for other samples as well? Or the run in general?

@benjjneb - my thanks as well for DADA2! We were originally waiting for the qiime2-dada2 module to drop but finally caved :)

benjjneb commented 8 years ago

trimLeft is now enforced before truncQ which solves this edge case of very low Q scores in the first few bases that are being trimmed off: 94fbe215242f80fc95b76b04b8229f0064e7ea7c