benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Assign Taxonomy using MIDORI sequence #1661

Closed Thyreus closed 1 month ago

Thyreus commented 1 year ago

Thank you so much for creating this pipeline. I have successfully used it for fungal and bacterial sequences. Unfortunately, when I tried to assign taxonomy for insect COI sequences from both MIDORI long and unique, I get the memory allocation failure error. I read online that this may be caused by running too many samples at a time and they suggest that I break it into smaller chunks as follows:

chunk.size <- 4000 chunks <- split(c(1:nrow(seqtab.nochim)), sort(c(1:nrow(seqtab.nochim))%%ceiling(nrow(seqtab.nochim)/chunk.size)))

chunks.species <- lapply(chunks, function(x){ return(assignTaxonomy(seqtab.nochim[x,], refFasta = "MIDORILONG.fa", verbose = TRUE)) }) taxtab.species <- do.call(rbind, chunks.species)`

However, the same problem persisted. Next, I tried only using the insect sequences from both MIDORI long and unique. This allowed the code to run. However, after running the code for 4 days straight, there was still no output. After moving the mouse a bit, the entire computer crashed. I tried chunking it, and this time it only took a few hours to crash. I tried removing everything off my computer except R studio and it also crashed. I tried removing all the sequences in MIDORI except for 5 fasta sequences and it ran, so I think MIDORI is too big perhaps? I am not really sure how to resolve the problem. UNITE ITS sequences has around the same number of FASTA sequences as MIDORI Long after filtering it to just the insect sequences.

benjjneb commented 1 year ago

Can you clarify the exact filename (of the downloaded file) of the MIDORI reference fastas you are using? I think "MIDORILONG.fa" is a rename.

Also, how much memory do you have available on your machine? And what is the output of R.version and packageVersion("dada2")?

Thyreus commented 1 year ago

Hi, sorry for the late reply. I downloaded [MIDORI2_LONGEST_NUC_GB252_CO1_DADA2.fasta.gz] and [MIDORI2_UNIQ_NUC_GB252_CO1_DADA2.fasta.gz].

I have 170 GB free on my machine. The output of R.version is 4.2.2 and the dada2 package is 1.26.0

benjjneb commented 1 year ago

When I run a test with MIDORI2_LONGEST_NUC_GB252_CO1_DADA2.fasta.gz (and same R/dada2 versions, on OSX) I was able to assign taxonomy to a sequence, but it did take up 56GB of memory. That said, it was slow: ~6 mins for 10 sequences classified, although my test machine only has 32GB of memory.

So, I think it probably would run especially given that your machine has plenty of memory, but you might want to run some timed tests on smaller batches of sequences, e.g. 100/200/400/800, and see if you can get a sense of the time scaling to estimate how long the full set would take.

To be honest the implementation of assignTaxonomy was developed with smaller reference databases as the use cases being tested against, so if this becomes rate limiting you might want to look for alternative tax assignment methods.

Thyreus commented 1 year ago

I changed the computer I was using and I got back the results but most of it only got sequenced to the Insecta_50557, none of it got identified to order, even after trying tryRC=TRUE. Looking at the quality plots, the reads of both the forward and reverse strands start to lost quality after 100bp out of 300bp. Should I try reducing or increasing the number of bp? Currently now its at 240/160 bp (forward/reverse).

benjjneb commented 1 year ago

I doubt that sequence quality is responsible for failure to assign to the order level. This is much more likely to be some kind of mismatch between the sequenced loci and the database, or maybe the database and the naive Bayesian classifier method (which is what assignTaxonomy implements).

That said, I have zero experience with COI sequencing, so I'm not sure what kind of taxonomic resolution should be expected.

Thyreus commented 1 year ago

Would it help if I use the 'addSpecies' function instead? Or how can I fix it? I would hope to get the resolution down to Family?

benjjneb commented 1 year ago

Would it help if I use the 'addSpecies' function instead?

No, that won't help here.