removeBimeraDenovo crashes after a long running time!

Ellie-BM commented 4 years ago

Hi, I have an issue in removeBimeraDenovo function. I have 185 samples and 373744 reads. When I run removaebimeradenovo system crashes after couple of hours of running without any error, and the R session does not stop even after pressing the stop button at this command. I have to completely reset R.

removeBimeraDenovo(seqtab, multi = 6)

I noticed you ask the following questions when this issue happens, so I provided them in case it helps.

packageVersion("dada2") [1] ‘1.14.0’ dim(seqtab) [1] 185 373744 summary(nchar(getSequences(seqtab))) Min. 1st Qu. Median Mean 3rd Qu. Max. 291.0 443.0 449.0 453.7 462.0 493.0

I will appreciate your help.

benjjneb commented 4 years ago

Could you tell me more about the system you are running this on? I.e. OS, processor and most importantly available RAM?

My guess is that you are running into memory challenges because of how many ASVs are in your table (373744 ASVs is pretty large).

Ellie-BM commented 4 years ago

Sure! MacBook Pro Processor: 2.4 GHz Intel Core i9 Memory 16 GB 2400 MHz DDR4

That is also my guess that I am running into memory issue. Is there any way to get around this issue?

benjjneb commented 4 years ago

Is there any way to get around this issue?

Do you have access to a higher memory compute environment? Otherwise, you could pre-filter your ASV table to remove e.g. sequences with very low abundance or present in only 1 sample, which probably will drop its size considerably.

Ellie-BM commented 4 years ago

I don't have access, but I believe 373744 is huge number of ASVs which I have never seen this. I think some thing is wrong. my denoised dadaF and dadaR both have many number of reads, for example in one sample: input: 60113, filtered: 49190, denoisedF:47952, denoisedR:48818

I wonder merging could cause a combinatorial explosion of ASVs.

I increased the minimum number of overlap to 50bp (truncLen(310,200), with 460 amplicon size) but still the dim(seqtab) is unbelievably high.

I need to mention that these are gut microbiome samples.

I appreciate your help or suggestions.

benjjneb commented 4 years ago

That does seem really high, and I suspect there is something going wrong in your denoising.

Did you remove primers from your reads? If not, you can see this kind of explosion in ASVs because of the variable base positions in unremoved primers.

Ellie-BM commented 4 years ago

I didn't check as I thought the demultiplexed samples that were given to me were already after removing primers. I will check now.

Ellie-BM commented 4 years ago

Thank you very much for you advice, they didn't remove the primers and after I trim the forward and reverse primers, everything works :), I appreciate your help.

Ellie-BM commented 4 years ago

Hi again :), I have a problem with duplicate sequences when running DADA2. There is no error, I just receive the message when running DADA2 that Duplicate sequences detected and merged when making makeSequenceTable. I have not had this message before that is why I am concerned about it.

my forward primer size is 16 and reverse primer size is 24. So I trimmed the forward sequence by 16 and reverse sequence by 24 from the start.

outpair=filterAndTrim(fwd=fnFs, filt=filtFs, rev=fnRs, filt.rev=filtRs, truncLen=c(315,215),trimLeft = c(16,24), maxN=0,
truncQ =2,
maxEE = c(2,2),rm.phix = TRUE, compress=TRUE,multithread = TRUE) My size of the region is ~460bp V3V4 Q1: I didn't remove anything from the end of the sequences which I am not sure if I am supposed to do it?

Also, my

dim(seqtab) [1] 185 22717

dim(seqtab.nochim) [1] 185 2365

which shows that I have very high chimera rate! is that normal?

benjjneb commented 4 years ago

truncLen=c(315,215)

What sequencing technology are you using that is giving paired reads >300nts long?

I just receive the message when running DADA2 that Duplicate sequences detected and merged when making makeSequenceTable

This probably isn't a problem, but if you wanted to investigate further you could inspect any potential duplicates sequences after merging. e.g. which(duplicated(getSequences(mergers[[1]])))).

which shows that I have very high chimera rate! is that normal?

It is not uncommon that many ASVs may be chimeric, but what is more important is how many reads were chimeric. What is sum(seqtab.nochim)/sum(seqtab)? That is the fraction of non-chimeric reads, and should be higher than 70%.

Ellie-BM commented 4 years ago

Hi, Q: What sequencing technology are you using that is giving paired reads >300nts long? A: I was told that the final library was sequenced on Illumnia MiSeq with v3 reagent kit (600 cycles). The sequencing was performed with 10% PhiX spike-in.

Q: What is sum(seqtab.nochim)/sum(seqtab)? That is the fraction of non-chimeric reads, and should be higher than 70%. A: True, sorry for the confusion, I noticed my mistake last night.

benjjneb commented 4 years ago

I was told that the final library was sequenced on Illumnia MiSeq with v3 reagent kit (600 cycles)

In almost all cases that gives you 300/300 nt reads, but you are enforcing a truncLen=c(315,215) which can onlyh work if the forward reads were longer than 300 nts. Is this some sort of custom setup with longer forward reads than reverse reads?

True, sorry for the confusion, I noticed my mistake last night.

So... what is the fraction of chimeric reads?

Ellie-BM commented 4 years ago

I used this code to find my read lengths for forward and reverse reads. Maybe I am wrong! sorry for lack of knowledge in this.
reads=c() for ( i in 1:length(fnFs)){ reads = round(c(reads, min(width(sread(readFastq(fnFs[i])))))) } reads ## Reverse read: 281, Front read:321

outpair=filterAndTrim(fwd=fnFs, filt=filtFs, rev=fnRs, filt.rev=filtRs, truncLen=c(315,215),trimLeft = c(25,34), ## reverse primer size + adapter= 34, forward primer size+ adapter=25 maxN=0,
truncQ =2,
maxEE = c(2,2),rm.phix = TRUE, compress=TRUE,multithread = TRUE)

I also attache quality plots for reverse R2 and forward R1,

The fraction of non-chimeric reads based on this formula: 1-sum(seqtab.nochim)/sum(seqtab) is ~10%.

I will appreciate if you help me where I am wrong or how should I correct (trunLen) my data.

benjjneb commented 4 years ago

reads ## Reverse read: 281, Front read:321

OK, it's just pretty rarely that people are generating asymmetric read lengths like this, so I was surprised, but if that's what was done that's OK.

The fraction of non-chimeric reads based on this formula: 1-sum(seqtab.nochim)/sum(seqtab) is ~10%.

That's fine, very much in line with what is expected. I don't see any major red flags at this point, you should be able to use this processed data.

Ellie-BM commented 4 years ago

Thanks so much! :)

amrmustafa-12 commented 1 year ago

hello , i have the same problem in chimera removing in DADA2 it didn't give any results after hour and half and i tried to find out the primers for this experiment but it's not avaliable in SRA database so how can i fix it.

my device specs core i5-8th gen 16 giga DDR4 2400

benjjneb commented 1 year ago

it didn't give any results after hour and half

This may just mean that it is taking longer to run. Can you look on your computer to see if the process is still running?

i tried to find out the primers for this experiment but it's not avaliable in SRA database so how can i fix it.

You need to find out if primers are on the reads and remove them if they are.

amrmustafa-12 commented 1 year ago

I've cancelled it several times because i thought that maybe there's another issue but it take so much time and the primers i can't find them can u help me with that if there's a tool to discover primers or how can i find it in SRA database? Thanks in advance

Note: the ASV number is approximately 143k

amrmustafa-12 commented 1 year ago

Another Q , how can i know the amplicon size? To help for trimming and filtering

benjjneb commented 1 year ago

I think you need to read the paper associated with the SRA entry, or reach out to the authors/depositors. You need to know the primers used or you aren't going to have success analyzing the data. Many common primer sets also have known amplicon length distributions associated with them.

benjjneb / dada2

removeBimeraDenovo crashes after a long running time! #974