benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

losing lots of reads with removeBimeraDenovo #887

Closed golrokh51 closed 4 years ago

golrokh51 commented 4 years ago

this is my command: seqtab.nochimPerSimple <- removeBimeraDenovo(seqtab, method="per-sample", verbose=T) Am I loosing a lot of reads? Are these numbers of remaining reads sufficient for an analysis of V4-V5? With method="consensus" and method="pooled" I loos even more.

Also I get these warnings:

Warning messages: 1: In doTryCatch(return(expr), name, parentenv, handler) : restarting interrupted promise evaluation 2: In doTryCatch(return(expr), name, parentenv, handler) : restarting interrupted promise evaluation

Is that OK?

this is my summary table:

sample dada2_input merged nonchim final_perc_reads_retained
A1 33836 25435 13524 40.0
A2 37649 27494 14291 38.0
AB1 58756 44827 29095 49.5
AB2 45984 35930 20359 44.3
AB3 60492 36592 26591 44.0
AB3-2 48842 28053 19926 40.8
AB4 74850 57637 33113 44.2
DA1 68531 52723 32670 47.7
DA2 49329 42516 36135 73.3
DA3 36172 27139 18670 51.6
G2 31051 25440 19436 62.6
J1 35664 26561 17302 48.5
J2 36477 27803 17299 47.4
LD1J0 27635 22713 14850 53.7
LD1J14 28056 19448 10993 39.2
LD1J21 35396 23899 14153 40.0
LD1J28 35757 23803 14358 40.2
LD1J35 39375 27962 16843 42.8
LD1J7 34194 24156 13194 38.6
LD2J14 32122 22877 13226 41.2
LD2J21 31216 20889 12156 38.9
LD2J28 25830 17683 10882 42.1
LD2J35 19191 13936 9719 50.6
LD2J7 31551 24343 14803 46.9
LD2JO 31377 25988 19743 62.9
LYOP1J7 32119 20141 15755 49.1
LYOP1JO 39245 27315 23116 58.9
LYSD1 26510 21591 14888 56.2
LYSD2 46236 39189 38560 83.4
LYSD3 22109 16644 16455 74.4
LYSD4 33178 27136 26689 80.4
LYSP-P1 50403 30408 18468 36.6
LYSP-P2 24851 16008 10094 40.6
LYSP-P3 36269 23065 13890 38.3
LYSP-P4 50111 32584 22517 44.9
PCR-neg-AZEB 1754 1455 1455 83.0
SI1 74186 61981 25376 34.2
SI2 74934 57424 45930 61.3
Z03 37061 30086 23929 64.6
ZO2 32448 25324 16525 50.9
benjjneb commented 4 years ago

Yes I would say you are losing too many reads in the chimera removal step (>30%). This is almost always because primers were not removed from the reads. Did you remove primers from your raw reads? This needs to be done before running dada2, or can be performed with the trimLeft parameter of the filterAndTrim function.

golrokh51 commented 4 years ago

yes I used cutadapt to remove them

This is my cutadapt command: cutadapt -a ^GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG -A ^CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC -m 200 -M 285 -o ../data /trimmed/${sample}_sub_R1_trimmed.fq.gz -p ../data/trimmed/${sample}_sub_R2_trimmed.fq.gz --untrimmed-output ../results/_cutadapt/${ sample}_untrimmed_R1.fastq --untrimmed-paired-output ../results/_cutadapt/${sample}_untrimmed_R2.fastq $f1 $f2 >> ../results/_cutada pt/cutadapt_primer_trimming_stats.txt 2>&1

golrokh51 commented 4 years ago

This is my cutadapt summary for one of the samples:

=== Summary ===

Total read pairs processed: 34,642 Read 1 with adapter: 33,935 (98.0%) Read 2 with adapter: 34,463 (99.5%) Pairs that were too short: 11 (0.0%) Pairs that were too long: 792 (2.3%) Pairs written (passing filters): 33,836 (97.7%)

Total basepairs processed: 20,821,704 bp Read 1: 10,424,117 bp Read 2: 10,397,587 bp Total written (filtered): 18,994,721 bp (91.2%) Read 1: 9,513,160 bp Read 2: 9,481,561 bp

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG; Type: linked; Length: 19+20; 5' trimmed: 33935 times; 3' trimmed: 10004 times

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Overview of removed sequences at 5' end length count expect max.err error counts 18 1883 0.0 1 0 1883 19 31961 0.0 1 30108 1853 20 91 0.0 1 0 91

Overview of removed sequences at 3' end length count expect max.err error counts 3 10000 541.3 0 10000 4 1 135.3 0 1 16 2 0.0 1 2 17 1 0.0 1 1

=== Second read: Adapter 4 ===

Sequence: CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC; Type: linked; Length: 20+19; 5' trimmed: 34463 times; 3' trimmed: 160 times

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1

Overview of removed sequences at 5' end length count expect max.err error counts 18 27 0.0 1 0 0 27 19 625 0.0 1 0 549 76 20 33786 0.0 2 32334 1345 107 21 25 0.0 2 0 17 8

Overview of removed sequences at 3' end length count expect max.err error counts 3 102 541.3 0 102 4 52 135.3 0 52 5 3 33.8 0 3 12 2 0.0 1 1 1 16 1 0.0 1 1

golrokh51 commented 4 years ago

I checked. The trimmed reads don't contain primers

benjjneb commented 4 years ago

I'm not sure, in almost all cases these big drops in chimiera removal are associated with unremoved primers.

Is there anything unusual about your library setup? For example, could there be any other technical bases such as sample barcodes or heterogenity spacers that are present after the primer sequences on the reads? Can you also confirm that you are following the tutorial workflow, or are you running a more custom workflow?

golrokh51 commented 4 years ago

I am following the tutorial workflow, other than for trimming that I use cutadapt. I will talk to our sequencing technician to see what else is there. Get back to you. THANKS

golrokh51 commented 4 years ago

I checked with them. Nothing else should be there after the primer. Is it possible that there is any other reasons causing this? The result of rarefaction is horrible

benjjneb commented 4 years ago

Can you share an example sample with me? My email is benjamin DOT j DOT callahan AT gmail DOT com

golrokh51 commented 4 years ago

Done

golrokh51 commented 4 years ago

So I talked to the technicien and she showed me the PCR quality check results. THere were no primer dimer. Could it be possible that the function is biased as these are amplified V4-V5 regions, and the conserved regions between V4 and V5 is larger?

golrokh51 commented 4 years ago

I can't do anything with such result :

print(dim(seqtab)) [1] 40 38434 seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T) Identified 34426 bimeras out of 38434 input sequences. print(dim(seqtab.nochim)) [1] 40 4008 print(dim(seqtab.nochimPerSample)) [1] 40 14327 print(dim(seqtab.nochimPooled)) [1] 40 2228

benjjneb commented 4 years ago

THanks for sending some example data, I was largely able to replicate the signal you are seeing of large amounts of chimeric reads being identified, with the following workflow:

library(dada2); packageVersion('dada2')
path <- "~/Desktop/someofmysamplesforissue887"; setwd(path)
fnF <- list.files(pattern="R1_001.fastq.gz")
fnR <- list.files(pattern="R2_001.fastq.gz")

plotQualityProfile(fnF) # Good throughout, maybe cut at 280
plotQualityProfile(fnR) # Pretty good, but cut at ~250 or a bit before

F515 <- "GTGYCAGCMGCCGCGGTAA"; R926 <- "CCGYCAATTYMTTTRAGTTT"
filtF <- file.path("filtered", fnF)
filtR <- file.path("filtered", fnR)
out <- filterAndTrim(fnF, filtF, fnR, filtR, trimLeft=c(nchar(F515), nchar(R926)), 
                     truncLen=c(275,245), maxEE=2)

errF <- learnErrors(filtF, multi=TRUE)
errR <- learnErrors(filtR, multi=TRUE)

ddF <- dada(filtF, err=errF, multi=TRUE)
ddR <- dada(filtR, err=errR, multi=TRUE)

mm <- mergePairs(ddF, filtF, ddR, filtR, verbose=TRUE)
sta <- makeSequenceTable(mm); dim(sta)
st <- removeBimeraDenovo(sta, verbose=TRUE)
# Identified 2811 bimeras out of 2916 input sequences.
bim <- isBimeraDenovoTable(sta)
table(bim)

FALSE TRUE 105 2811

sum(st)/sum(sta)

[1] 0.6097736

The question is why is this? Although far from comprehensive, what I tried was the blast the top few chimeras identified, and to look for signatures of whether they are legitimate chimeras, or might be mistakenly identified as chimeras:

dada2:::pfasta(head(getSequences(sta)[bim]))
# BLAST against nt

What I see is consistent with these being legitimate chimeras (with one exception). That is, they perfectly match a sequence in nt over one half of the sequence, but have lots of mismsmatches exclusively in the other half, as if they are a chimera of two different sequences.

This suggests to me that you may have a dataset with a very high chimera rate, perhaps due to the PCR conditions used to amplify the initial DNA. Certain choices such as high cycle number, short extension times, or insufficient reagents, can increase the amount of chimeras in the final output.

golrokh51 commented 4 years ago

Thanks for the followups. I am still not sure why I loose much much less reads as chimers using per-sample model? I loose ~34000 in ~38000 reads in consensus and pooled.

> print(dim(seqtab.nochimCons)) 
[1]   40 4008
> print(dim(seqtab.nochimPerSimple)) 
[1]    40 14327
> print(dim(seqtab.nochimPooled)) 
[1]   40 2228
benjjneb commented 4 years ago

I'm not sure either. From our testing, the best method in the normal workflow is "consensus" (the default). "per-sample" is not recommended because it will make different determinations of whether the same ASV is chimeric in different samples.

What I'd recommend instead is considering raising the minFoldParentOverabundance parameter, there are some reports that a higher value than the default might be more appropriate, and will be more conservative in identifying chimeras. So something like minFoldParentOverAbundance=4 or 8, but using the default "consensus" algorithm.

golrokh51 commented 4 years ago

The page for BioRxiv is not found. I am trying to modify the option and come back to you. THANKS

golrokh51 commented 4 years ago

I played with minFoldParentOverAbundance:

> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=4)
Identified 30364 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=6)
Identified 27694 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=8)
Identified 25278 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=10)
Identified 23028 bimeras out of 38434 input sequences.

Seems like it detects less and less chimeras. Should I go up to? How many reads I need at least to have a legitimate analysis?

benjjneb commented 4 years ago

How many reads I need at least to have a legitimate analysis?

There is no threshold number of reads you need for a legitimate analysis. Ideally, you want the most accurate data possible.

Seems like it detects less and less chimeras. Should I go up to?

A value like 4 or 8 is justifiable given previous reports: https://www.biorxiv.org/content/10.1101/074252v1

I probably wouldn't go higher than that.

golrokh51 commented 4 years ago

I have a question, I was looking at the 16rRNA structure. I see that the sum of variable regions in V3-V4 is 160nt and the conserved region length is 100nt. And for V4-V5,the sum of variable regions is 100 whereas the conserved region between them is 160nt long. So reads should have a longer overlap in case of V4-V5. Could it affect the result of chimera detection? It seems like in the chimera.cpp that the detection of chimera depends to some extend to the length of the overlap. Is this the source code for R function? Could it affect the dada2 decision on which reads are bimeras?

benjjneb commented 4 years ago

In the prevalence-based method, can you confirm that Decontam doesn't take the abundance information into account, but only the presence/absence? That is to say, if a given taxa is seen with 1 read in all negative controls and seen with 2000 reads in all samples, it will be considered a contaminant, right ? If this is the case, do you recommend, for each sample, filtering out taxa with very few reads before running Decontam ?

That won't have any impact on the chimera detection algorithm, which doesn't know about secondary structure or about the overlap between the reads (chimera removal is performed after merging the reads together).

It can have an effect on chimera generation though. The presence of a conserved region in the middle of the amplicon is conducive to more chimeras being created, as it is easier for an incomple amplicon from an earlier PCR step to anneal to another amplicon and form a chimera in a later step when there is high sequence similarity there.

golrokh51 commented 4 years ago

Thanks Benjamin.

pandan74 commented 4 years ago

Hi All, i have similar problem, but I have 2 type of samples in one run: lung and fecal. Fecal samples look just fine but for the the lung samples I lost almost all reads after chimera removing and merging: sequence_pipeline_stats.txt

Any suggestions? This is 2x150 V4 fragment, all samples were processed and sequenced together. Demultiplexing was done with idemp.