Closed elsherbini closed 6 years ago
Losing 75% definitely suggests a problem.
Did you try mergePairs(..., returnRejects=TRUE)
? That will return the full merger data.frame
, including those pairs of denoised sequence variants that failed the merge.
Or are you trying to get back to the fastq reads?
I was hoping to get back at least the sequences of failed mergers so i could manually inspect some of them.
Here's a sample where ~90% of reads couldn't be merged. Some samples have more like 50% of reads lost at the merge step.
This is 16S V4 with Miseq 300 paired end (I know to stay away from 300 next time!)
https://drive.google.com/drive/folders/0B8zpZcbxw7kdYkpxaldUb1dkdFE?usp=sharing
Could be a trimming issue. Miseq is long enough to get most (all?) of V4 with either direction, which also means the can be direction-specific weirdness. What trimming parameters did you use? Especially, truncLen
?
My guess is that this is a trimming issue from one of two places: (1) This is a low quality sequencing run with lots of Q=2 bases, and the default truncation at the first Q=2 is causing reads to fail to merge, or (2) you didn't truncate the reads before they read into the reverse primers, as the 300nt reads are longer than the ~252nt V4 region.
Could you see if changing the trimming to something like filterAndTrim(..., truncLen=c(240,200), truncQ=0)
helps? Where 240/200 should be what's appropriate given the quality profile, but both <=250 to avoid read-through.
If that doesn't work, could you share the raw fastqs for one sample?
Truncating more aggressively helped a lot - I'm now only losing ~40% of reads at the merge step. I will try again with trunQ=0.
Thanks both of you for offering to help troubleshoot. I guess the purpose of this issue was to see if it was easy in inspect read pairs from the denoise step (which i guess get fed into mergePairs) to help me troubleshoot myself. Is it possible to get a fastq or fasta back from the denoise object for inspection?
edit or better yet - a fasta/q of the denoised reads that failed to merge from the mergePairs step. It seemed like returning the failures didn't give me the sequences for the failed pairs.
edit or better yet - a fasta/q of the denoised reads that failed to merge from the mergePairs step. It seemed like returning the failures didn't give me the sequences for the failed pairs.
You can get a data.frame with the failed mergers that includes the F/R sequences as follows:
mm <- mergePairs(ddF, drpF, ddR, drpR, returnRejects=TRUE, propagateCol="sequence")
mm <- mm[!mm$accept,]
mm
now is a data.frame of just failed merges, and the mm$F.sequence
and mm$R.sequence
columns contain the F/R sequences respectively. If you really want a fasta, you can write one out using writeFasta
or uniquesToFasta
.
You can also "propagate" any other column in the $clustering
data.frame of a dada-class object into the data.frame output by mergePairs
, which can be useful for diagnostics sometimes.
Hi, I'm also trying to get rejected pairs back.
I've used the previous idea, but it retrieves an error, I think it is related to list vs. dataframe format, but I don't know how to solve it.
`mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap = 15, propagateCol="sequence", returnRejects=TRUE)
head(mergers[[1]])`
The output is fine, but
rejected <- mergers[!mergers$accept,]
Error in !mergers$accept : invalid argument type
How can I get the rejected pairs?
thanks!!!
If you are processing multiple samples, your mergers
object will be a list (number of samples long) of merger data.frames. So you will need to walk through that list to pull out the rejected pairs for each sample. In shorthand:
mm <- mergers[[1]] # merger data.frame from the first sample
mm[!mm$accept,]
And so forth for the other samples
I have soil samples that were sequenced using MiSeq for 16S amplicon analysis. However, I encountered significant issues during the merging step, resulting in the loss of nearly all reads. Attempting to use only the forward reads did not resolve the problem either.
This my script: qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path /home/mosabdel/Mim/soil-manifest.csv \ --input-format PairedEndFastqManifestPhred33 \ --output-path /home/mosabdel/Mim/soil-paired-end-demux.qza
qiime tools peek soil-paired-end-demux.qza ###############################################################################################
qiime demux summarize \ --i-data soil-paired-end-demux.qza \ --o-visualization soil-paired-end-demux.qzv #############################################################################################
qiime cutadapt trim-paired \ --i-demultiplexed-sequences soil-paired-end-demux.qza \ --p-adapter-f "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG" \ --p-front-f "CCTACGGGNGGCWGCAG" \ --p-adapter-r "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG" \ --p-front-r "GACTACHVGGGTATCTAATCC" \ --o-trimmed-sequences soil-paired-demux-trimmed.qza
#############################################################################################
qiime demux summarize --i-data soil-paired-demux-trimmed.qza --o-visualization soil-paired-demux-trimmed.qzv
################################################################################################
qiime dada2 denoise-paired \ --p-n-threads 24 \ --i-demultiplexed-seqs soil-paired-demux-trimmed.qza \ --p-trim-left-f 0 \ --p-trunc-len-f 170 \ --p-trim-left-r 0 \ --p-trunc-len-r 160 \
--o-denoising-stats dada2-denoising-stats.qza \ --o-table dada2-table.qza \ --o-representative-sequences dada2-rep-seqs.qza \ --verbose
qiime dada2 denoise-single \ --i-demultiplexed-seqs /home/mosabdel/Mim/soil-paired-demux-trimmed.qza \ --p-trim-left 0 \ --p-trunc-len 250 \ --o-denoising-stats /home/mosabdel/Mim/dada2-denoising-stats.qza \ --o-table /home/mosabdel/Mim/dada2-table.qza \ --o-representative-sequences /home/mosabdel/Mim/dada2-rep-seqs.qza \ --p-n-threads 24 \ --verbose ######################################################## Is there any advise Thanks
@mosabdel
Failure at merging is usually because the reads were truncated too short to overlap, which is required for merging. How long is your sequenced amplicon? Your forward and reverse truncation lengths must add up to that length, plus we recommend another 20bps at least to account for minimum ovlerap requirements and biological length variation in your locus.
Given the clear quality drop off in your data, using just the forward reads alone should be considered. Since that takes out the merging step I don't understand how it "did not resolve the problem" -- could you clarify?
Hi Benjineb, Thanks for your message, The length of my sequence amplicon was300 bp for R and F. I use the --o-trimmed-sequences soil-paired-demux-trimmed.qza for denoising using the F reads only and ignore the R one, qiime dada2 denoise-single --i-demultiplexed-seqs /home/mosabdel/Mim/soil-paired-demux-trimmed.qza --p-trim-left 0 --p-trunc-len 250 --o-denoising-stats /home/mosabdel/Mim/dada2-denoising-stats.qza --o-table /home/mosabdel/Mim/dada2-table.qza --o-representative-sequences /home/mosabdel/Mim/dada2-rep-seqs.qza --p-n-threads 24 --verbose but this approach did not works as mentioned early (I don't know the reason) !! ########### Thus, I run again the analysis with single F reads and this relatively works and provide number of reads for further downstream analysis considering the low quality of the seq.
qiime tools import \ --type 'SampleData[SequencesWithQuality]' \ --input-path /home/mosabdel/Mim/soil-manifest1.csv \ --input-format SingleEndFastqManifestPhred33 \ --output-path /home/mosabdel/Mim/soil-single-end-demux.qza
qiime tools peek /home/mosabdel/Mim/soil-single-end-demux.qza
qiime demux summarize \ --i-data /home/mosabdel/Mim/soil-single-end-demux.qza \ --o-visualization /home/mosabdel/Mim/soil-single-end-demux.qzv
qiime cutadapt trim-single \ --i-demultiplexed-sequences /home/mosabdel/Mim/soil-single-end-demux.qza \ --p-adapter "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG" \ --p-front "CCTACGGGNGGCWGCAG" \ --o-trimmed-sequences /home/mosabdel/Mim/soil-single-demux-trimmed.qza
qiime demux summarize \ --i-data /home/mosabdel/Mim/soil-single-demux-trimmed.qza \ --o-visualization /home/mosabdel/Mim/soil-single-demux-trimmed.qzv
qiime dada2 denoise-single \ --i-demultiplexed-seqs /home/mosabdel/Mim/soil-single-demux-trimmed.qza \ --p-trim-left 0 \ --p-trunc-len 260 \ --o-denoising-stats /home/mosabdel/Mim/dada2-denoising-stats.qza \ --o-table /home/mosabdel/Mim/dada2-table.qza \ --o-representative-sequences /home/mosabdel/Mim/dada2-rep-seqs.qza \ --p-n-threads 24 \ --verbose How do you think? Thanks again for your time and help.
How do you think? Thanks again for your time and help.
It doesn't raise any red flags. Obviously you are using a big chunk of your reads, but that isn't due to a fundamental issue in the data workflow, it's because the read quality is low leading to ~half lost at filtering, and some more at denoising. That kind of read loss isn't necessarily a bad thing -- keeping a lot of low quality reads can lead to worse estimates of community composition than just looking at a more modest number of high quality reads.
Thanks, Yes, I totally agree, and I running again and try to reduce the -p-trunc-len. I am appreciating your dedication and help in the community. Best,
I'm working with samples and having issues at the merge step (I'm losing ~75% of my reads at that step). I want to look at some denoised read pairs to see if I can identify the problem. If I have the denoised objects in memory, say
dadaF
anddadaR
(An object of class "dada"
), how do i get corresponding pairs of reads from those objects?