memory allocation error with clean_seqs() when calling checkForRemoteErrors()

morien commented 3 years ago

I'm trying to run the clean_seqs function on a large set of sequences (around 5GB total of CO1 sequences, pulled from genbank).

> filtered1 <- clean_seqs(uniqSeqs1, model, minscore = 100, cores=44, shave=TRUE) I get this error after the function runs for quite some time (many hours). Error in checkForRemoteErrors(val) : one node produced an error: cannot allocate vector of size 1086.8 Gb I've tried reducing the size of the dataset by dividing it into two, then four equal parts, but the issue remains. I've also tried reducing the number of cores used (down from 44 to 14, so far) to see if that reduced the memory footprint needed, but it also had no effect.

Do you have any idea what might be causing this, and how to work around it?

morien commented 3 years ago

So just a brief update on this, I had some mitochondrial genomes in my sequence set (as you do in your tutorial) and removing them (As a test, I filtered my input fasta to sequences <=10000 bp in length) has resolved this error.

What is the largest sequence set you've tested taxreturn on? Is it possible that you cannot use it with hundreds of thousands of sequences as input, if some of them are also very long? Does a memory footprint of >1TB for fitting ~480k sequences to the model seem normal to you, if a few thousand sequences in the set are mitochondrial genomes?

alexpiper commented 3 years ago

Glad to hear you've somewhat resolved this, but that is a surprising amount of memory being used.

I recently reimplemented the clean_seqsfunction as map_to_model which should be slightly (but not massively) faster and handle multithreading better, and I've now made sure to update the tutorial to include this function instead. You could give this new function a try and see if things improve.

However i did some testing today with both clean_seqsand map_to_model and unfortunately both functions really do slow down with lots of long sequences such as mitochondrial genomes. While there are probably some efficiency gains to be made in my R code, the most computationally intensive step is the alignment to the PHMM, for which i am already using a C++ implementation of the the Viterbi algorithm from the aphid package. There seems to be a little bit of information out there on speeding up PHMM alignment for long sequences by splitting them into chunks or doing some pre-alignment, and i will look into this further when i get a chance.

The largest dataset i have personally used this package with was all Insecta & Arachnida sequences in GenBank and BOLD, which came out to ~3.5Million sequences. For this step I used 8 cores and 400GB memory and left it over a weekend. I followed a slightly different workflow to what is currently in the package tutorial, which you can see here. Some of the main changes were to resolve taxonomic synonyms and filter insufficiently identified taxa earlier in the workflow to reduce the amount of sequences that had to be aligned.

I'm currently in the final stages of writing my thesis so I haven't had much time recently to work on this package. But I should be done in about a month and will start looking into improving speed and memory usage, writing a 'big data' focused tutorial, and polishing the package into something a bit more user friendly.

morien commented 3 years ago

hey alex, i meant to thank you earlier for all these details. looking forward to the planned changes!

alexpiper commented 3 years ago

In the latest release I've incorporated a k-mer based pre-screening into map_to_model which has resulted in a ~3.8x speedup for long sequences. This works by splitting any sequence that is more than twice the length of the input PHMM into chunks, finding the most similar chunk to the reference, then using that chunk and its neighbors to seed the Viterbi alignment. Sorry this took so long to get around to!

alexpiper / taxreturn

memory allocation error with clean_seqs() when calling checkForRemoteErrors() #23