Loss of many reads in Reverse and merger steps

sarah-mangum commented 2 years ago

Hello @benjjneb,

Thanks for dada2 we are following your ITS pipeline, but are having issues with some unexpected read loss. We find ourselves losing a large majority of our fungal reads and we are consistently having quality issues with our R2. Interestingly we have noticed that we are consistently losing reads among common genera. We have been conducting testing using known mock communities as well as our experimental communities. We have attached the raw fastq files from an ATCC mycobiome mix as well as some experimental samples as an example showing loss of the Fusarium genus in combination with the aforementioned general read loss.

Sequencing Parameters: V2 2X250 MiSeq

Cutadapt: 'cutadapt -a GCATATCAATAAGCGGAGGA -A "GCTGCGTTCTTCATCGATGC;max_error_rate=0.35" --no-indels -o '+output_path+sample_name+'R1_001.fastq -p '+output_path+sample_name+'R2_001.fastq '+path+sample_name+'R1_001.fastq '+path+sample_name+'R2_001.fastq'

FilterAndTrim Params: out <- filterAndTrim(fwd = fnFs, filt = filtFs, rev = fnRs, filt.rev = filtRs, truncQ = 8, maxN = 0, maxEE = c(8, 8), rm.phix = TRUE, compress = TRUE, verbose = TRUE)

Fastq (raw) ATCCmycobiomemix.txt RC3335953-MSITS3_S21_L001_R2_001.zip

benjjneb commented 2 years ago

Do you know the lengths of the sequenced amplicon from each of the strains you are looking at?

ITS amplicons have much more variable lengths than other common amplicon-sequencing targets. This can cause a systematic dropout of taxa with long ITS regions if the reads don't sufficiently overlap to be merged. I would look to see if that is what might be happening in your mock community at least.

A solution is to just use the forward reads, as this skips the problematic merging step.

sarah-mangum commented 2 years ago

Thanks for the quick reply, @benjjneb! We don't know the exact lengths of all the strains we are looking at but on average it varies from upper 200s to upper 400s in base pairs. We sequence on a V2 nano run at 2x250 so we will range from almost completely overlapping sequences to smaller overlaps of sequences.

We do see many samples merge successfully within a run, but not sure why some seem to lose all or almost all of their reads typically in the denoisedR step. For example, with the files above, here are the log reports:

,"readsIn","filtered","denoisedF","denoisedR","merged","tabled" experimental1,1323,1276,1273,247,240,240 experimental2,3163,3125,3122,1236,1236,1236 experimental3,2760,2730,2728,1,0,0 experimental4,4122,4071,3951,1156,1156,1156 experimental5,4567,4442,4403,170,164,164 "RC3335953",1339,1339,1324,1229,1215,1215

Our R2 quality is not great on these runs but still seems odd to lose so many in some samples and not so many in others.

If using the forward reads only for these problematic samples, is it possible to merge the ones that are able to and use the forwards only for the ones that cannot merge within the same dada2 run?

benjjneb commented 2 years ago

The failure of most reverse reads in many samples (nearly all in some) to pass denoising is a serious red flag. There is something going on that isn't obvious yet.

For now, I would not used merged data here without understanding what is going on with the reverse reads. Not even for the subset of samples where reverse denoising seemed to work. The forward reads alone should be fine.

To understand what's going on with the reverse reads may require a more detailed investigation, but I'm not sure exactly where to start since I've never seen this exact behavior before (good denoising of forward, but almost complete failure in reverse, in some samples but not all). First guesses are some sort of weirdness in the library prep. Looking at plotComplexity on some of the reverse read files that are failing could be a place to start. Second I might just start looking at the reads themselves, for example in the experimental3 sample where none are passing denoising. Does anything pop up? Do they BLAST to the right kind of stuff? Are they in the expected orientation?

sarah-mangum commented 2 years ago

Thank you for your help, @benjjneb, we were not sure where to start either, here are the results of your suggestions: R_complexity_plot_Fungalunfiltered_githubsamples.pdf The majority of their reverse reads did hit to the expected species/genus using Blast, for experimental3 sample it's top hits were uncultured fungus though it hit the expected Malassezia sp at a lesser score. Both read1 and read2 for these samples seem to be in the expected orientation.

sarah-mangum commented 2 years ago

Hi @benjjneb, we still don't have a concrete answer on what exactly is going on with some samples reverse read loss but as best we can tell it may be due to low quality for some samples. That being said though it would be nice to use the merged reads for the ones that pass for that extra bit of info. At the same time, for the ones that fail to merge, we'd rather use the forward read than lose the sample altogether. Something like returning rejects to justConcatenate but instead of concatenating, only use forward reads. Is this possible?

benjjneb commented 2 years ago

That being said though it would be nice to use the merged reads for the ones that pass for that extra bit of info. At the same time, for the ones that fail to merge, we'd rather use the forward read than lose the sample altogether.

I would be worried this is going to introduce some unexpected bias based on how long the underlying sequence is (which can drive merging/non-merging). In my opinion, its better here to just use forward reads universally.

Something like returning rejects to justConcatenate but instead of concatenating, only use forward reads. Is this possible?

Yes, one could run assignTaxonomy(..., returnRejects=TRUE). Then replace the $sequence value for each $accept=FALSE row with the corresponding forward sequence for that pair. Not what I would recommend though.

benjjneb / dada2

Loss of many reads in Reverse and merger steps #1464