losing lots of reads with removeBimeraDenovo

golrokh51 commented 4 years ago

this is my command: seqtab.nochimPerSimple <- removeBimeraDenovo(seqtab, method="per-sample", verbose=T) Am I loosing a lot of reads? Are these numbers of remaining reads sufficient for an analysis of V4-V5? With method="consensus" and method="pooled" I loos even more.

Also I get these warnings:

Warning messages: 1: In doTryCatch(return(expr), name, parentenv, handler) : restarting interrupted promise evaluation 2: In doTryCatch(return(expr), name, parentenv, handler) : restarting interrupted promise evaluation

Is that OK?

this is my summary table:

sample	dada2_input	merged	nonchim	final_perc_reads_retained
A1	33836	25435	13524	40.0
A2	37649	27494	14291	38.0
AB1	58756	44827	29095	49.5
AB2	45984	35930	20359	44.3
AB3	60492	36592	26591	44.0
AB3-2	48842	28053	19926	40.8
AB4	74850	57637	33113	44.2
DA1	68531	52723	32670	47.7
DA2	49329	42516	36135	73.3
DA3	36172	27139	18670	51.6
G2	31051	25440	19436	62.6
J1	35664	26561	17302	48.5
J2	36477	27803	17299	47.4
LD1J0	27635	22713	14850	53.7
LD1J14	28056	19448	10993	39.2
LD1J21	35396	23899	14153	40.0
LD1J28	35757	23803	14358	40.2
LD1J35	39375	27962	16843	42.8
LD1J7	34194	24156	13194	38.6
LD2J14	32122	22877	13226	41.2
LD2J21	31216	20889	12156	38.9
LD2J28	25830	17683	10882	42.1
LD2J35	19191	13936	9719	50.6
LD2J7	31551	24343	14803	46.9
LD2JO	31377	25988	19743	62.9
LYOP1J7	32119	20141	15755	49.1
LYOP1JO	39245	27315	23116	58.9
LYSD1	26510	21591	14888	56.2
LYSD2	46236	39189	38560	83.4
LYSD3	22109	16644	16455	74.4
LYSD4	33178	27136	26689	80.4
LYSP-P1	50403	30408	18468	36.6
LYSP-P2	24851	16008	10094	40.6
LYSP-P3	36269	23065	13890	38.3
LYSP-P4	50111	32584	22517	44.9
PCR-neg-AZEB	1754	1455	1455	83.0
SI1	74186	61981	25376	34.2
SI2	74934	57424	45930	61.3
Z03	37061	30086	23929	64.6
ZO2	32448	25324	16525	50.9

benjjneb commented 4 years ago

Yes I would say you are losing too many reads in the chimera removal step (>30%). This is almost always because primers were not removed from the reads. Did you remove primers from your raw reads? This needs to be done before running dada2, or can be performed with the trimLeft parameter of the filterAndTrim function.

golrokh51 commented 4 years ago

yes I used cutadapt to remove them

This is my cutadapt command: cutadapt -a ^GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG -A ^CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC -m 200 -M 285 -o ../data /trimmed/${sample}_sub_R1_trimmed.fq.gz -p ../data/trimmed/${sample}_sub_R2_trimmed.fq.gz --untrimmed-output ../results/_cutadapt/${ sample}_untrimmed_R1.fastq --untrimmed-paired-output ../results/_cutadapt/${sample}_untrimmed_R2.fastq $f1 $f2 >> ../results/_cutada pt/cutadapt_primer_trimming_stats.txt 2>&1

golrokh51 commented 4 years ago

This is my cutadapt summary for one of the samples:

=== Summary ===

Total read pairs processed: 34,642 Read 1 with adapter: 33,935 (98.0%) Read 2 with adapter: 34,463 (99.5%) Pairs that were too short: 11 (0.0%) Pairs that were too long: 792 (2.3%) Pairs written (passing filters): 33,836 (97.7%)

Total basepairs processed: 20,821,704 bp Read 1: 10,424,117 bp Read 2: 10,397,587 bp Total written (filtered): 18,994,721 bp (91.2%) Read 1: 9,513,160 bp Read 2: 9,481,561 bp

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG; Type: linked; Length: 19+20; 5' trimmed: 33935 times; 3' trimmed: 10004 times

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Overview of removed sequences at 5' end length count expect max.err error counts 18 1883 0.0 1 0 1883 19 31961 0.0 1 30108 1853 20 91 0.0 1 0 91

Overview of removed sequences at 3' end length count expect max.err error counts 3 10000 541.3 0 10000 4 1 135.3 0 1 16 2 0.0 1 2 17 1 0.0 1 1

=== Second read: Adapter 4 ===

Sequence: CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC; Type: linked; Length: 20+19; 5' trimmed: 34463 times; 3' trimmed: 160 times

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1

Overview of removed sequences at 5' end length count expect max.err error counts 18 27 0.0 1 0 0 27 19 625 0.0 1 0 549 76 20 33786 0.0 2 32334 1345 107 21 25 0.0 2 0 17 8

Overview of removed sequences at 3' end length count expect max.err error counts 3 102 541.3 0 102 4 52 135.3 0 52 5 3 33.8 0 3 12 2 0.0 1 1 1 16 1 0.0 1 1

golrokh51 commented 4 years ago

I checked. The trimmed reads don't contain primers

benjjneb commented 4 years ago

I'm not sure, in almost all cases these big drops in chimiera removal are associated with unremoved primers.

Is there anything unusual about your library setup? For example, could there be any other technical bases such as sample barcodes or heterogenity spacers that are present after the primer sequences on the reads? Can you also confirm that you are following the tutorial workflow, or are you running a more custom workflow?

golrokh51 commented 4 years ago

I am following the tutorial workflow, other than for trimming that I use cutadapt. I will talk to our sequencing technician to see what else is there. Get back to you. THANKS

golrokh51 commented 4 years ago

I checked with them. Nothing else should be there after the primer. Is it possible that there is any other reasons causing this? The result of rarefaction is horrible

benjjneb commented 4 years ago

Can you share an example sample with me? My email is benjamin DOT j DOT callahan AT gmail DOT com

golrokh51 commented 4 years ago

Done

golrokh51 commented 4 years ago

So I talked to the technicien and she showed me the PCR quality check results. THere were no primer dimer. Could it be possible that the function is biased as these are amplified V4-V5 regions, and the conserved regions between V4 and V5 is larger?

golrokh51 commented 4 years ago

I can't do anything with such result :

print(dim(seqtab)) [1] 40 38434 seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T) Identified 34426 bimeras out of 38434 input sequences. print(dim(seqtab.nochim)) [1] 40 4008 print(dim(seqtab.nochimPerSample)) [1] 40 14327 print(dim(seqtab.nochimPooled)) [1] 40 2228

benjjneb commented 4 years ago

THanks for sending some example data, I was largely able to replicate the signal you are seeing of large amounts of chimeric reads being identified, with the following workflow:

library(dada2); packageVersion('dada2')
path <- "~/Desktop/someofmysamplesforissue887"; setwd(path)
fnF <- list.files(pattern="R1_001.fastq.gz")
fnR <- list.files(pattern="R2_001.fastq.gz")

plotQualityProfile(fnF) # Good throughout, maybe cut at 280
plotQualityProfile(fnR) # Pretty good, but cut at ~250 or a bit before

F515 <- "GTGYCAGCMGCCGCGGTAA"; R926 <- "CCGYCAATTYMTTTRAGTTT"
filtF <- file.path("filtered", fnF)
filtR <- file.path("filtered", fnR)
out <- filterAndTrim(fnF, filtF, fnR, filtR, trimLeft=c(nchar(F515), nchar(R926)), 
                     truncLen=c(275,245), maxEE=2)

errF <- learnErrors(filtF, multi=TRUE)
errR <- learnErrors(filtR, multi=TRUE)

ddF <- dada(filtF, err=errF, multi=TRUE)
ddR <- dada(filtR, err=errR, multi=TRUE)

mm <- mergePairs(ddF, filtF, ddR, filtR, verbose=TRUE)
sta <- makeSequenceTable(mm); dim(sta)
st <- removeBimeraDenovo(sta, verbose=TRUE)
# Identified 2811 bimeras out of 2916 input sequences.
bim <- isBimeraDenovoTable(sta)

table(bim)

FALSE TRUE 105 2811

sum(st)/sum(sta)

[1] 0.6097736

The question is why is this? Although far from comprehensive, what I tried was the blast the top few chimeras identified, and to look for signatures of whether they are legitimate chimeras, or might be mistakenly identified as chimeras:

dada2:::pfasta(head(getSequences(sta)[bim]))
# BLAST against nt

What I see is consistent with these being legitimate chimeras (with one exception). That is, they perfectly match a sequence in nt over one half of the sequence, but have lots of mismsmatches exclusively in the other half, as if they are a chimera of two different sequences.

This suggests to me that you may have a dataset with a very high chimera rate, perhaps due to the PCR conditions used to amplify the initial DNA. Certain choices such as high cycle number, short extension times, or insufficient reagents, can increase the amount of chimeras in the final output.

golrokh51 commented 4 years ago

Thanks for the followups. I am still not sure why I loose much much less reads as chimers using per-sample model? I loose ~34000 in ~38000 reads in consensus and pooled.

> print(dim(seqtab.nochimCons)) 
[1]   40 4008
> print(dim(seqtab.nochimPerSimple)) 
[1]    40 14327
> print(dim(seqtab.nochimPooled)) 
[1]   40 2228

benjjneb commented 4 years ago

I'm not sure either. From our testing, the best method in the normal workflow is "consensus" (the default). "per-sample" is not recommended because it will make different determinations of whether the same ASV is chimeric in different samples.

What I'd recommend instead is considering raising the minFoldParentOverabundance parameter, there are some reports that a higher value than the default might be more appropriate, and will be more conservative in identifying chimeras. So something like minFoldParentOverAbundance=4 or 8, but using the default "consensus" algorithm.

golrokh51 commented 4 years ago

The page for BioRxiv is not found. I am trying to modify the option and come back to you. THANKS

golrokh51 commented 4 years ago

I played with minFoldParentOverAbundance:

> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=4)
Identified 30364 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=6)
Identified 27694 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=8)
Identified 25278 bimeras out of 38434 input sequences.
> seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=T, minFoldParentOverAbundance=10)
Identified 23028 bimeras out of 38434 input sequences.

Seems like it detects less and less chimeras. Should I go up to? How many reads I need at least to have a legitimate analysis?

benjjneb commented 4 years ago

How many reads I need at least to have a legitimate analysis?

There is no threshold number of reads you need for a legitimate analysis. Ideally, you want the most accurate data possible.

Seems like it detects less and less chimeras. Should I go up to?

A value like 4 or 8 is justifiable given previous reports: https://www.biorxiv.org/content/10.1101/074252v1

I probably wouldn't go higher than that.

golrokh51 commented 4 years ago

I have a question, I was looking at the 16rRNA structure. I see that the sum of variable regions in V3-V4 is 160nt and the conserved region length is 100nt. And for V4-V5,the sum of variable regions is 100 whereas the conserved region between them is 160nt long. So reads should have a longer overlap in case of V4-V5. Could it affect the result of chimera detection? It seems like in the chimera.cpp that the detection of chimera depends to some extend to the length of the overlap. Is this the source code for R function? Could it affect the dada2 decision on which reads are bimeras?

benjjneb commented 4 years ago

In the prevalence-based method, can you confirm that Decontam doesn't take the abundance information into account, but only the presence/absence? That is to say, if a given taxa is seen with 1 read in all negative controls and seen with 2000 reads in all samples, it will be considered a contaminant, right ? If this is the case, do you recommend, for each sample, filtering out taxa with very few reads before running Decontam ?

That won't have any impact on the chimera detection algorithm, which doesn't know about secondary structure or about the overlap between the reads (chimera removal is performed after merging the reads together).

It can have an effect on chimera generation though. The presence of a conserved region in the middle of the amplicon is conducive to more chimeras being created, as it is easier for an incomple amplicon from an earlier PCR step to anneal to another amplicon and form a chimera in a later step when there is high sequence similarity there.

golrokh51 commented 4 years ago

Thanks Benjamin.

pandan74 commented 4 years ago

Hi All, i have similar problem, but I have 2 type of samples in one run: lung and fecal. Fecal samples look just fine but for the the lung samples I lost almost all reads after chimera removing and merging: sequence_pipeline_stats.txt

Any suggestions? This is 2x150 V4 fragment, all samples were processed and sequenced together. Demultiplexing was done with idemp.

benjjneb / dada2

losing lots of reads with removeBimeraDenovo #887