benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Faster collapseNoMismatch() #626

Open masumistadler opened 5 years ago

masumistadler commented 5 years ago

Hello Benjamin,

I've merged sequence tables of several dada() runs (in summary around 680 samples) and I am running the collapseNoMismatch() right now and it takes quite a while (it's running for at least a day).

Now, I have noticed that this function does not have a multithread = TRUE option as most of your computationally more expensive functions do. I was wondering whether there is a way to enable multithreading. It would be great to speed up the process.

benjjneb commented 5 years ago

There currently is not, but this could be an enhancement request. We have not really run into cases where this is the rate limiting step before, so haven't made it a priority, but we can consider it in the future.

masumistadler commented 5 years ago

That would be great for the future. Thank you!

leholman commented 5 years ago

I am using a set of primers that are very 'slippy' so I generate many different lengths of merged reads for the same biological sequence. collapseNoMismatch seems like just the thing to sort this out but it takes a while!

A multithreaded version would be really useful.

masumistadler commented 5 years ago

For your information, the collapseNoMismatch() step took around 4 days for 630 16S amplicon sequencing samples with approx. 199000 ASVs.

benjjneb commented 5 years ago

Thanks for the reports. I've added this issue to the 1.14 milestone, which is our next release in ~6 months.

No promises, but it's at least on the formal agenda now!

leholman commented 5 years ago

🥺 4 days. Thanks for the info Masumi!

I'll try and benchmark the function while I run it on the various datasets and post the results up here.

leholman commented 5 years ago

I have finished running the function on some datasets. See benchmarks below.

Dataset 1: marine eDNA metabarcoding of COI of ~70 samples: 37674 OTUs - 7476 seconds Dataset 2: marine eDNA metabarcoding of 18S of ~70 samples: 10260 OTUs - 12029 seconds Dataset 3: marine eDNA metabarcoding of 16S of ~70 samples: 19228 OTUs - 6210 seconds

Roughly 1% of sequences for the COI dataset were merged, while only 2 were merged for the 18S dataset. I pulled the code for the function and added a counter to monitor progress. The code seems to run exponentially slower as it runs through the dataset.

I also ran the RStudio code profiler on the function (http://rpubs.com/lholman/codeprofile1). Looks like grepl, substr and an if statement use most of the resources. I run into problems like this with my own code so I'm interested in different approaches to speed up these functions.

ptpell77 commented 3 years ago

Hi, Curious if there is now a beta version of the parallelized collapseNoMismatch command? This seems to be bottleneck in collapsing dozens of sequencing runs

benjjneb commented 3 years ago

Curious if there is now a beta version of the parallelized collapseNoMismatch command? This seems to be bottleneck in collapsing dozens of sequencing runs

Nope there is not. But more reports that this is a relevant bottleneck will help push this further up the queue, so thank you.

ARBramucci commented 3 years ago

Hi Benjamin, I agree that this step is really slowing up the processing of multiple plates, however along with that I am curious, if processing say 60 plates wouldn't it be better for the collapse no mismatch step to come at the end of the process, so that when you combine multiple plates you are not introducing potential mismatches say if seq A and B are different by 1 nucleotide at the end of the seq, and in plate 1 A > B but in plate 2 B> A then when combined you will once again have both A and B even though they are 1 nucleotide different in length without mismatches...??

masumistadler commented 3 years ago

Hi, I run the collapseNoMismatch() step always on the final sequence table that includes all plates. I do not really see the point of doing it on all the individual sequence tables separately. To my knowledge, if one would collapse ASVs on a plate basis, if you have a collapsed ASV from plate A and collapsed ASV from plate B, they will only be merged together in the merged sequence table if they have the exact same sequence. And they may be different, if you check the help page of the function, it says, the sequence that is being kept for the collapsed ASV is that of the most abundant. This means, if the sequence that is most abundant for a collapsed ASV is the same across all plates, they should be merged together when merging sequence tables, but if not, they will appear as two separate ASVs. You could run the collapse on a plate to plate basis, but I would suggest running another collapse on the final sequence table. Hope this answers your concern.

ARBramucci commented 3 years ago

So masumistadler, in order to do as you suggest above, you export your ASV tables as csv files and then you compile all the plates together into one file and feed it back into dada2 for the collapse no mismatch step? and you don't run into formatting issues?

masumistadler commented 3 years ago

No, I save all seqtab files from the individual plates as an .rds file. I read them in and join them with mergeSequenceTables(), it's a function to merge several seqtabs provided by dada2.

Here is some example code:

# read in all sequence tables from plates (there are 24)
merger <- list() # create empty list
# read in data with loop
for(i in 1:24){
merger[[i]] <- readRDS(paste0("./Objects/",i,"_seqtab.rds"))
}

# merge all tables
st.all <- mergeSequenceTables(merger[[1]],merger[[2]],merger[[3]],merger[[4]],merger[[5]],merger[[6]],merger[[7]],merger[[8]],merger[[9]],
merger[[10]],merger[[11]],merger[[12]],merger[[13]],merger[[14]],merger[[15]],merger[[16]],merger[[17]],merger[[18]],merger[[19]],
merger[[20]],merger[[21]],merger[[22]],merger[[23]],merger[[24]])
saveRDS(st.all, "./Objects/all_seqtab.rds") # save

# collapse ASVs with identical sequence
col.st <- collapseNoMismatch(st.all, minOverlap = 20, verbose = T)
saveRDS(col.st, "./Objects/collapsed_seqtab.rds")

Hope this helps!

ARBramucci commented 3 years ago

Thank you so much!!! I will try it now but that looks like it will work!!

zanieb commented 3 years ago

Hi! Just a +1 that this is something we are noticing may be a bottleneck in our pipeline. We can report actual compute times if that would be helpful too.

ARBramucci commented 3 years ago

Quick question about the collapse no mismatch and running multiple plates with dada2. I have 2 plates and they have slightly different optimal truncLengths (optimal defined as what % make it all the way through the pipeline), should I optimise them both to use the same TruncLengths or does that not matter if I am doing collapse no mismatch on them after the fact because any different lengths will be collapse together? thanks heaps! Anna

benjjneb commented 3 years ago

@ARBramucci Is this paired-end data? If so, the merged reads should cover the same genetic region (i.e spanning from primer to primer) even if the truncated forward/reverse read lengths were slightly different.

If this is single-end data, I think the best option is to use the same truncation length across the runs.

ARBramucci commented 3 years ago

It is paired-end, 300 bp reads. So yes I agree with you that slightly different lengths should not matter. Thanks a lot for the confirmation! Anna

alexandrasimas commented 2 years ago

Hi! Is there any update on the multithreading for collapseNoMismatch issue? And/or is there a recommended workaround? I am experiencing an issue where collapseNoMismatch is getting hung up; I'm running my Dada2 workflow Rscript on an AWS instance with 18.04.1-Ubuntu (48 cores & 375G of mem).

I have 34 samples. After chimera removal there are 9,568 sequences remaining. They are from PacBio so they are full-length 16S. Here are the stats: Min. 1st Qu. Median Mean 3rd Qu. Max. 1320 1413 1438 1436 1451 1598

I let the collapseNoMismatch step run for ~12 hours and it's still hung up. I've never experienced this length of time before, though I get that this is a LOT of LONG ASVs.

For a workaround, since these are PacBio HiFi reads I was considering just skipping this step. Would you agree? Or since these all should have the same ends following the universal primer removal, could I assume that any sequences that might be mergeable are within a few bps of each other? That way, I could subset the seqtab.nochim by sequence length and do collapseNoMismatch on the subsets.

benjjneb commented 2 years ago

You shouldn't need to run collapsNoMismatch on normal data, including PacBio HiFi data that has had their primers consistently trimmed.

alexandrasimas commented 2 years ago

Ah okay fair point! I'll institute a rule to not run collapseNoMismatch if primer trimming has occurred! Can you confirm my thinking that if I hadn't trimmed primers, PacBio HiFi still would not need collapseNoMismatch? Thanks!

benjjneb commented 2 years ago

if I hadn't trimmed primers, PacBio HiFi still would not need collapseNoMismatch?

If you hadn't trimmed primes, you would have fundamental problems at the denoising step.

The trimming to the inter-primer sequence also alleviates any problems with library preparations that might put in variable length pads, which I don't think is common in PacBio HiFi, but is seen sometimes with Illumina.

alexandrasimas commented 2 years ago

Sure that makes sense. Thank you!!

ARBramucci commented 2 years ago

Hi any word on speeding up the collapse no mismatch step with a multithread option? I have been collapsing samples for 3 weeks now and I can't really keep waiting on this step...but I need those 100% identical ASVs merged. Is there an alternative way of doing this outside of dada2 if this is a rate limiting step? cheers Anna

benjjneb commented 2 years ago

@ARBramucci No update other than this isn't a priority and probably won't be addressed in the near future. Note that if you have processed your data in such a way that all sequences are starting and ending at the same position (which is normally the case for amplicon sequencing data) this step is not needed.

soluna1 commented 1 year ago

Hi, I've experience also the problem of time consuming when running the collpaseNoMissmatch function. But, reading the comments here, I doubt about the need of its application in my case. Data come from v4-v5 18s amplicons from 2 Illumina runs.

The code I'm using takes several (in this case just 2, that come from 2 plates) seqtables (obtained after dada(), merge Pairs() and makeSequenceTable() functions) and merges those seqtable:

list.df <- map(seqtables, readRDS)
st.all <- mergeSequenceTables(tables = list.df) 

Then, I apply the chimera removal step:

seqtab.raw <- removeBimeraDenovo(st.all, method="consensus",multithread=TRUE, verbose = TRUE)

the output is trimmed (to keep only ASVs with a particular length): seqtab <- seqtab.raw[,nchar(colnames(seqtab.raw)) %in% seq(360, 410)]

and the collapse step:

final.seqtab <- collapseNoMismatch(seqtab, minOverlap = 20)

Do you considere necessary to apply the collpaseNoMismatch() function here, or it's irrelevant as seqtables were merged before?

Thanks!!!

benjjneb commented 1 year ago

@soluna1 If you used the same primer set and the same trimLeft parameters in each run, there should be no need for a collapseNoMismatch step.

That function is mostly intended for dealing with length variation that can crop up between similar but slightly different primer sets, or if different runs were trimmed differently.

dsamoht commented 3 months ago

Hi, I wrote a script that wraps CD-HIT-EST for the 100% identity clustering step. I don't have rigorous benchmarking results yet, but for ~16,000 ASVs and 56 samples, I obtained identical results compared to collapseNoMismatch, but in only 30 seconds, which is magnitudes faster. Note that it's not thoroughly tested yet, but maybe you could give it a try. The log file will also tell you which ASVs were collapsed together; this could help compare the two methods.

Let me know if you have any comments or suggestions!

benjjneb commented 3 months ago

Awesome, thanks @dsamoht !