chimera filtering with >6 billion sequence variants...

colinaverill commented 6 years ago

Hi All,

I realize this is more of a chimera problem than a dada2 problem, though the dada2 approach results in many more sequences to check for chimeras than traditional clustering approaches. I am working with particularly large datasets. I have a SV table with 1024 rows and >6 billion unique sequences as columns. When I run dada2::removeBimeraDenovo(SV.table, method = 'consensus', multithread = TRUE), I get the following error:

 *** caught segfault ***
address 0x2b53de35fc18, cause 'memory not mapped'

Traceback:
 1: .Call("_dada2_C_table_bimera2", PACKAGE = "dada2", mat, seqs,     min_fold, min_abund, allow_one_off, min_one_off_par_dist,     match, mismatch, gap_p, max_shift)
 2: C_table_bimera2(seqtab, sqs, minFoldParentOverAbundance, minParentAbundance,     allowOneOff, minOneOffParentDistance, getDadaOpt("MATCH"),     getDadaOpt("MISMATCH"), getDadaOpt("GAP_PENALTY"), maxShift)
 3: isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose)
 4: dada2::removeBimeraDenovo(t.out, method = "consensus")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/sge/scc-c06/job_scripts/7950109: line 40: 135415 Segmentation fault

This is true even when I submit this job on a cluster node with 36 cores and >1TB of RAM. This command runs fine if I subset the dataset to the first few thousand columns. 6 billion is just too many to handle with this approach, I think.

How else can I go about chimera filtering this data set in a reproducible way? I realize I could subset this some number of times and chimera filter each subset, however the end result may depend on particularly how the data was subsetted. It may be that I just need to accept this, which I am willing to do, but thought people here might have a better idea.

benjjneb commented 6 years ago

Not gonna lie, this is outside the regimes we have considered, let alone tested, for chimera removal. I've run it up to millions of ASVs no problem, but...

The issue here is the memory. The blue sky solution is a streaming version of isBimeraDenovoTable, but we haven't implemented that. My first reaction is to try filtering out all the 1-sample ASVs first, but that depends a bit on the overall study goals as to whether that is fully legitimate.

Hm....

ps: How did you end up with 6 billion ASVs?

colinaverill commented 6 years ago

Good to know this is a real big data problem.

On 6 billions SVs: I was wrong, it is 6 million, not billion, this is my mistake. I am working with ITS and 16S data from the National Ecological Observation Network, trying to forecast changes in soil microbiomes in space and time. This is ~30 samples per site, across many very different sites. The first time point is a lot of data, and this will likely continue to be true.

Given that chimeras generally form during PCR, I am going to break apart my data by site and sampling date and chimera filter the subsets. As long as I keep this strategy moving forward it should be reproducible.

Any reason to worry too much about chimeras generated from something other than PCR? If not, then no need to chimera filter more than one sample at a time, subsetting the columns of the SV table to only chimera check SVs actually observed in a given sample.

benjjneb commented 6 years ago

Any reason to worry too much about chimeras generated from something other than PCR? If not, then no need to chimera filter more than one sample at a time, subsetting the columns of the SV table to only chimera check SVs actually observed in a given sample.

Yep you are right, and this is actually what removeBimeraDenovo is doing in its default "consensus" mode. The issue is just that the implementation loads everything all into memory at once as we weren't considering datasets as large as yours.

Arguably the best solution for your purposes would be to run chimera identification on each sample individually, record the results, and then do a by hand "consensus" classification for each ASV (i.e. an ASV is a chimera if flagged as such in XX% of sample in which it appeared). Will take a bit of R hacking, but isn't impossible.

I'll also think about what it would take to support a low memory "streaming" version of removeBimeraDenovo in the future for such large datasets, although that's not likely to happen soon.

colinaverill commented 6 years ago

This makes sense! I was able to get around this by filtering out singletons, which brought me from 6 million SVs to 800,000 SVs, however this will likely come up in the future. If I do come up with an R hack to do streaming chimera filtering, wrapped around the dada2 chimera check, I will post it here.

benjjneb / dada2

chimera filtering with >6 billion sequence variants... #558