benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Inspecting read pairs from denoise step #324

Closed elsherbini closed 6 years ago

elsherbini commented 7 years ago

I'm working with samples and having issues at the merge step (I'm losing ~75% of my reads at that step). I want to look at some denoised read pairs to see if I can identify the problem. If I have the denoised objects in memory, say dadaF and dadaR (An object of class "dada"), how do i get corresponding pairs of reads from those objects?

benjjneb commented 7 years ago

Losing 75% definitely suggests a problem.

Did you try mergePairs(..., returnRejects=TRUE)? That will return the full merger data.frame, including those pairs of denoised sequence variants that failed the merge.

Or are you trying to get back to the fastq reads?

elsherbini commented 7 years ago

I was hoping to get back at least the sequences of failed mergers so i could manually inspect some of them.

Here's a sample where ~90% of reads couldn't be merged. Some samples have more like 50% of reads lost at the merge step.

This is 16S V4 with Miseq 300 paired end (I know to stay away from 300 next time!)

https://drive.google.com/drive/folders/0B8zpZcbxw7kdYkpxaldUb1dkdFE?usp=sharing

joey711 commented 7 years ago

Could be a trimming issue. Miseq is long enough to get most (all?) of V4 with either direction, which also means the can be direction-specific weirdness. What trimming parameters did you use? Especially, truncLen?

benjjneb commented 7 years ago

My guess is that this is a trimming issue from one of two places: (1) This is a low quality sequencing run with lots of Q=2 bases, and the default truncation at the first Q=2 is causing reads to fail to merge, or (2) you didn't truncate the reads before they read into the reverse primers, as the 300nt reads are longer than the ~252nt V4 region.

Could you see if changing the trimming to something like filterAndTrim(..., truncLen=c(240,200), truncQ=0) helps? Where 240/200 should be what's appropriate given the quality profile, but both <=250 to avoid read-through.

If that doesn't work, could you share the raw fastqs for one sample?

elsherbini commented 7 years ago

Truncating more aggressively helped a lot - I'm now only losing ~40% of reads at the merge step. I will try again with trunQ=0.

Thanks both of you for offering to help troubleshoot. I guess the purpose of this issue was to see if it was easy in inspect read pairs from the denoise step (which i guess get fed into mergePairs) to help me troubleshoot myself. Is it possible to get a fastq or fasta back from the denoise object for inspection?

edit or better yet - a fasta/q of the denoised reads that failed to merge from the mergePairs step. It seemed like returning the failures didn't give me the sequences for the failed pairs.

benjjneb commented 6 years ago

edit or better yet - a fasta/q of the denoised reads that failed to merge from the mergePairs step. It seemed like returning the failures didn't give me the sequences for the failed pairs.

You can get a data.frame with the failed mergers that includes the F/R sequences as follows:

mm <- mergePairs(ddF, drpF, ddR, drpR, returnRejects=TRUE, propagateCol="sequence")
mm <- mm[!mm$accept,]

mm now is a data.frame of just failed merges, and the mm$F.sequence and mm$R.sequence columns contain the F/R sequences respectively. If you really want a fasta, you can write one out using writeFasta or uniquesToFasta.

You can also "propagate" any other column in the $clustering data.frame of a dada-class object into the data.frame output by mergePairs, which can be useful for diagnostics sometimes.

soluna1 commented 1 year ago

Hi, I'm also trying to get rejected pairs back.

I've used the previous idea, but it retrieves an error, I think it is related to list vs. dataframe format, but I don't know how to solve it.

`mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap = 15, propagateCol="sequence", returnRejects=TRUE)

head(mergers[[1]])`

The output is fine, but

rejected <- mergers[!mergers$accept,]

Error in !mergers$accept : invalid argument type

How can I get the rejected pairs?

thanks!!!

benjjneb commented 1 year ago

If you are processing multiple samples, your mergers object will be a list (number of samples long) of merger data.frames. So you will need to walk through that list to pull out the rejected pairs for each sample. In shorthand:

mm <- mergers[[1]] # merger data.frame from the first sample
mm[!mm$accept,]

And so forth for the other samples

mosabdel commented 3 months ago

I have soil samples that were sequenced using MiSeq for 16S amplicon analysis. However, I encountered significant issues during the merging step, resulting in the loss of nearly all reads. Attempting to use only the forward reads did not resolve the problem either.

Denosie F-quality R-quality Denosie

This my script: qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path /home/mosabdel/Mim/soil-manifest.csv \ --input-format PairedEndFastqManifestPhred33 \ --output-path /home/mosabdel/Mim/soil-paired-end-demux.qza

Step 3: Check quality plots and sequence length

qiime tools peek soil-paired-end-demux.qza ###############################################################################################

Summarise the reads

qiime demux summarize \ --i-data soil-paired-end-demux.qza \ --o-visualization soil-paired-end-demux.qzv #############################################################################################

Trim amplicon primers

qiime cutadapt trim-paired \ --i-demultiplexed-sequences soil-paired-end-demux.qza \ --p-adapter-f "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG" \ --p-front-f "CCTACGGGNGGCWGCAG" \ --p-adapter-r "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG" \ --p-front-r "GACTACHVGGGTATCTAATCC" \ --o-trimmed-sequences soil-paired-demux-trimmed.qza

#############################################################################################

viz out put as qzv and input must be identical to the output name "

qiime demux summarize --i-data soil-paired-demux-trimmed.qza --o-visualization soil-paired-demux-trimmed.qzv

################################################################################################

Step 4: DADA2 length trimming, denoising, chimera and PhiX removal

qiime dada2 denoise-paired \ --p-n-threads 24 \ --i-demultiplexed-seqs soil-paired-demux-trimmed.qza \ --p-trim-left-f 0 \ --p-trunc-len-f 170 \ --p-trim-left-r 0 \ --p-trunc-len-r 160 \

--o-denoising-stats dada2-denoising-stats.qza \ --o-table dada2-table.qza \ --o-representative-sequences dada2-rep-seqs.qza \ --verbose

Step 4: DADA2 denoising using forward reads only

qiime dada2 denoise-single \ --i-demultiplexed-seqs /home/mosabdel/Mim/soil-paired-demux-trimmed.qza \ --p-trim-left 0 \ --p-trunc-len 250 \ --o-denoising-stats /home/mosabdel/Mim/dada2-denoising-stats.qza \ --o-table /home/mosabdel/Mim/dada2-table.qza \ --o-representative-sequences /home/mosabdel/Mim/dada2-rep-seqs.qza \ --p-n-threads 24 \ --verbose ######################################################## Is there any advise Thanks

benjjneb commented 2 months ago

@mosabdel

Failure at merging is usually because the reads were truncated too short to overlap, which is required for merging. How long is your sequenced amplicon? Your forward and reverse truncation lengths must add up to that length, plus we recommend another 20bps at least to account for minimum ovlerap requirements and biological length variation in your locus.

Given the clear quality drop off in your data, using just the forward reads alone should be considered. Since that takes out the merging step I don't understand how it "did not resolve the problem" -- could you clarify?

mosabdel commented 2 months ago

Hi Benjineb, Thanks for your message, The length of my sequence amplicon was300 bp for R and F. I use the --o-trimmed-sequences soil-paired-demux-trimmed.qza for denoising using the F reads only and ignore the R one, qiime dada2 denoise-single --i-demultiplexed-seqs /home/mosabdel/Mim/soil-paired-demux-trimmed.qza --p-trim-left 0 --p-trunc-len 250 --o-denoising-stats /home/mosabdel/Mim/dada2-denoising-stats.qza --o-table /home/mosabdel/Mim/dada2-table.qza --o-representative-sequences /home/mosabdel/Mim/dada2-rep-seqs.qza --p-n-threads 24 --verbose but this approach did not works as mentioned early (I don't know the reason) !! ########### Thus, I run again the analysis with single F reads and this relatively works and provide number of reads for further downstream analysis considering the low quality of the seq.

Step 1

qiime tools import \ --type 'SampleData[SequencesWithQuality]' \ --input-path /home/mosabdel/Mim/soil-manifest1.csv \ --input-format SingleEndFastqManifestPhred33 \ --output-path /home/mosabdel/Mim/soil-single-end-demux.qza

Step 2:

qiime tools peek /home/mosabdel/Mim/soil-single-end-demux.qza

Summarize the reads

qiime demux summarize \ --i-data /home/mosabdel/Mim/soil-single-end-demux.qza \ --o-visualization /home/mosabdel/Mim/soil-single-end-demux.qzv

Step 3:

qiime cutadapt trim-single \ --i-demultiplexed-sequences /home/mosabdel/Mim/soil-single-end-demux.qza \ --p-adapter "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG" \ --p-front "CCTACGGGNGGCWGCAG" \ --o-trimmed-sequences /home/mosabdel/Mim/soil-single-demux-trimmed.qza

Summarize the trimmed reads

qiime demux summarize \ --i-data /home/mosabdel/Mim/soil-single-demux-trimmed.qza \ --o-visualization /home/mosabdel/Mim/soil-single-demux-trimmed.qzv

Step 4: DADA2 denoising using single-end reads

qiime dada2 denoise-single \ --i-demultiplexed-seqs /home/mosabdel/Mim/soil-single-demux-trimmed.qza \ --p-trim-left 0 \ --p-trunc-len 260 \ --o-denoising-stats /home/mosabdel/Mim/dada2-denoising-stats.qza \ --o-table /home/mosabdel/Mim/dada2-table.qza \ --o-representative-sequences /home/mosabdel/Mim/dada2-rep-seqs.qza \ --p-n-threads 24 \ --verbose image How do you think? Thanks again for your time and help.

benjjneb commented 2 months ago

How do you think? Thanks again for your time and help.

It doesn't raise any red flags. Obviously you are using a big chunk of your reads, but that isn't due to a fundamental issue in the data workflow, it's because the read quality is low leading to ~half lost at filtering, and some more at denoising. That kind of read loss isn't necessarily a bad thing -- keeping a lot of low quality reads can lead to worse estimates of community composition than just looking at a more modest number of high quality reads.

mosabdel commented 2 months ago

Thanks, Yes, I totally agree, and I running again and try to reduce the -p-trunc-len. I am appreciating your dedication and help in the community. Best,