Closed golrokh51 closed 4 years ago
Yes I would say you are losing too many reads in the chimera removal step (>30%). This is almost always because primers were not removed from the reads. Did you remove primers from your raw reads? This needs to be done before running dada2, or can be performed with the trimLeft
parameter of the filterAndTrim
function.
yes I used cutadapt to remove them
This is my cutadapt command:
cutadapt -a ^GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG -A ^CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC -m 200 -M 285 -o ../data /trimmed/${sample}_sub_R1_trimmed.fq.gz -p ../data/trimmed/${sample}_sub_R2_trimmed.fq.gz --untrimmed-output ../results/_cutadapt/${ sample}_untrimmed_R1.fastq --untrimmed-paired-output ../results/_cutadapt/${sample}_untrimmed_R2.fastq $f1 $f2 >> ../results/_cutada pt/cutadapt_primer_trimming_stats.txt 2>&1
This is my cutadapt summary for one of the samples:
=== Summary ===
Total read pairs processed: 34,642 Read 1 with adapter: 33,935 (98.0%) Read 2 with adapter: 34,463 (99.5%) Pairs that were too short: 11 (0.0%) Pairs that were too long: 792 (2.3%) Pairs written (passing filters): 33,836 (97.7%)
Total basepairs processed: 20,821,704 bp Read 1: 10,424,117 bp Read 2: 10,397,587 bp Total written (filtered): 18,994,721 bp (91.2%) Read 1: 9,513,160 bp Read 2: 9,481,561 bp
=== First read: Adapter 1 ===
Sequence: GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG; Type: linked; Length: 19+20; 5' trimmed: 33935 times; 3' trimmed: 10004 times
No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1
No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20 bp: 2
Overview of removed sequences at 5' end length count expect max.err error counts 18 1883 0.0 1 0 1883 19 31961 0.0 1 30108 1853 20 91 0.0 1 0 91
Overview of removed sequences at 3' end length count expect max.err error counts 3 10000 541.3 0 10000 4 1 135.3 0 1 16 2 0.0 1 2 17 1 0.0 1 1
=== Second read: Adapter 4 ===
Sequence: CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC; Type: linked; Length: 20+19; 5' trimmed: 34463 times; 3' trimmed: 160 times
No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20 bp: 2
No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1
Overview of removed sequences at 5' end length count expect max.err error counts 18 27 0.0 1 0 0 27 19 625 0.0 1 0 549 76 20 33786 0.0 2 32334 1345 107 21 25 0.0 2 0 17 8
Overview of removed sequences at 3' end length count expect max.err error counts 3 102 541.3 0 102 4 52 135.3 0 52 5 3 33.8 0 3 12 2 0.0 1 1 1 16 1 0.0 1 1
I checked. The trimmed reads don't contain primers
I'm not sure, in almost all cases these big drops in chimiera removal are associated with unremoved primers.
Is there anything unusual about your library setup? For example, could there be any other technical bases such as sample barcodes or heterogenity spacers that are present after the primer sequences on the reads? Can you also confirm that you are following the tutorial workflow, or are you running a more custom workflow?
I am following the tutorial workflow, other than for trimming that I use cutadapt. I will talk to our sequencing technician to see what else is there. Get back to you. THANKS
I checked with them. Nothing else should be there after the primer. Is it possible that there is any other reasons causing this? The result of rarefaction is horrible
Can you share an example sample with me? My email is benjamin DOT j DOT callahan AT gmail DOT com
Done
So I talked to the technicien and she showed me the PCR quality check results. THere were no primer dimer. Could it be possible that the function is biased as these are amplified V4-V5 regions, and the conserved regions between V4 and V5 is larger?
I can't do anything with such result :
print(dim(seqtab)) [1] 40 38434 seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T) Identified 34426 bimeras out of 38434 input sequences. print(dim(seqtab.nochim)) [1] 40 4008 print(dim(seqtab.nochimPerSample)) [1] 40 14327 print(dim(seqtab.nochimPooled)) [1] 40 2228
THanks for sending some example data, I was largely able to replicate the signal you are seeing of large amounts of chimeric reads being identified, with the following workflow:
library(dada2); packageVersion('dada2')
path <- "~/Desktop/someofmysamplesforissue887"; setwd(path)
fnF <- list.files(pattern="R1_001.fastq.gz")
fnR <- list.files(pattern="R2_001.fastq.gz")
plotQualityProfile(fnF) # Good throughout, maybe cut at 280
plotQualityProfile(fnR) # Pretty good, but cut at ~250 or a bit before
F515 <- "GTGYCAGCMGCCGCGGTAA"; R926 <- "CCGYCAATTYMTTTRAGTTT"
filtF <- file.path("filtered", fnF)
filtR <- file.path("filtered", fnR)
out <- filterAndTrim(fnF, filtF, fnR, filtR, trimLeft=c(nchar(F515), nchar(R926)),
truncLen=c(275,245), maxEE=2)
errF <- learnErrors(filtF, multi=TRUE)
errR <- learnErrors(filtR, multi=TRUE)
ddF <- dada(filtF, err=errF, multi=TRUE)
ddR <- dada(filtR, err=errR, multi=TRUE)
mm <- mergePairs(ddF, filtF, ddR, filtR, verbose=TRUE)
sta <- makeSequenceTable(mm); dim(sta)
st <- removeBimeraDenovo(sta, verbose=TRUE)
# Identified 2811 bimeras out of 2916 input sequences.
bim <- isBimeraDenovoTable(sta)
table(bim)
FALSE TRUE 105 2811
sum(st)/sum(sta)
[1] 0.6097736
The question is why is this? Although far from comprehensive, what I tried was the blast the top few chimeras identified, and to look for signatures of whether they are legitimate chimeras, or might be mistakenly identified as chimeras:
dada2:::pfasta(head(getSequences(sta)[bim]))
# BLAST against nt
What I see is consistent with these being legitimate chimeras (with one exception). That is, they perfectly match a sequence in nt over one half of the sequence, but have lots of mismsmatches exclusively in the other half, as if they are a chimera of two different sequences.
This suggests to me that you may have a dataset with a very high chimera rate, perhaps due to the PCR conditions used to amplify the initial DNA. Certain choices such as high cycle number, short extension times, or insufficient reagents, can increase the amount of chimeras in the final output.
Thanks for the followups. I am still not sure why I loose much much less reads as chimers using per-sample model? I loose ~34000 in ~38000 reads in consensus and pooled.
> print(dim(seqtab.nochimCons))
[1] 40 4008
> print(dim(seqtab.nochimPerSimple))
[1] 40 14327
> print(dim(seqtab.nochimPooled))
[1] 40 2228
I'm not sure either. From our testing, the best method in the normal workflow is "consensus"
(the default). "per-sample"
is not recommended because it will make different determinations of whether the same ASV is chimeric in different samples.
What I'd recommend instead is considering raising the minFoldParentOverabundance
parameter, there are some reports that a higher value than the default might be more appropriate, and will be more conservative in identifying chimeras. So something like minFoldParentOverAbundance=4
or 8
, but using the default "consensus"
algorithm.
The page for BioRxiv is not found. I am trying to modify the option and come back to you. THANKS
I played with minFoldParentOverAbundance:
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=4)
Identified 30364 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=6)
Identified 27694 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=8)
Identified 25278 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=10)
Identified 23028 bimeras out of 38434 input sequences.
Seems like it detects less and less chimeras. Should I go up to? How many reads I need at least to have a legitimate analysis?
How many reads I need at least to have a legitimate analysis?
There is no threshold number of reads you need for a legitimate analysis. Ideally, you want the most accurate data possible.
Seems like it detects less and less chimeras. Should I go up to?
A value like 4 or 8 is justifiable given previous reports: https://www.biorxiv.org/content/10.1101/074252v1
I probably wouldn't go higher than that.
I have a question, I was looking at the 16rRNA structure. I see that the sum of variable regions in V3-V4 is 160nt and the conserved region length is 100nt. And for V4-V5,the sum of variable regions is 100 whereas the conserved region between them is 160nt long. So reads should have a longer overlap in case of V4-V5. Could it affect the result of chimera detection? It seems like in the chimera.cpp that the detection of chimera depends to some extend to the length of the overlap. Is this the source code for R function? Could it affect the dada2 decision on which reads are bimeras?
In the prevalence-based method, can you confirm that Decontam doesn't take the abundance information into account, but only the presence/absence? That is to say, if a given taxa is seen with 1 read in all negative controls and seen with 2000 reads in all samples, it will be considered a contaminant, right ? If this is the case, do you recommend, for each sample, filtering out taxa with very few reads before running Decontam ?
That won't have any impact on the chimera detection algorithm, which doesn't know about secondary structure or about the overlap between the reads (chimera removal is performed after merging the reads together).
It can have an effect on chimera generation though. The presence of a conserved region in the middle of the amplicon is conducive to more chimeras being created, as it is easier for an incomple amplicon from an earlier PCR step to anneal to another amplicon and form a chimera in a later step when there is high sequence similarity there.
Thanks Benjamin.
Hi All, i have similar problem, but I have 2 type of samples in one run: lung and fecal. Fecal samples look just fine but for the the lung samples I lost almost all reads after chimera removing and merging: sequence_pipeline_stats.txt
Any suggestions? This is 2x150 V4 fragment, all samples were processed and sequenced together. Demultiplexing was done with idemp.
this is my command:
seqtab.nochimPerSimple <- removeBimeraDenovo(seqtab, method="per-sample", verbose=T)
Am I loosing a lot of reads? Are these numbers of remaining reads sufficient for an analysis of V4-V5? Withmethod="consensus"
andmethod="pooled"
I loos even more.Also I get these warnings:
Is that OK?
this is my summary table: