dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Expected memory usage #80

Open matnguyen opened 1 year ago

matnguyen commented 1 year ago

What's the expected memory usage per genome for Dashing2? I'm trying to run it on 500,000 viral isolates, and am running out of memory even with 500GB

dnbaker commented 1 year ago

May I have the full command you're using? Any additional information you could provide would be helpful as well.

In the default all-pairs mode, memory is allocated in a few key areas:

  1. Sketches. This will be around (num_entities sketch_size 8) bytes for the default modes. For 500,000 entities + default sketch size, I would expect ~4GB.

  2. Kmers, If you're saving them as well. This is again (num_entities sketch_size 8), but 0 otherwise.

  3. Parsing buffers. Each thread reuses a buffer when parsing files. Its size is the length of the longest sequence it has encountered yet, rounded up to the nearest power of 2. For assembled eukaryotic genomes, (3) can big especially if highly-multithreaded, but I doubt it's the problem for viral assemblies.

  4. Temporary data prepared for I/O.

When writing out the distances, chunks of data are computed in parallel, and they are each added to a queue of results to be written to disk. It's possible that the I/O was slow enough that the program stored an excessive amount of distance data.

(1) and (2) can be reduced by using -o to specify an output location for a sketch database. Then the data is mmap'd instead of stored in RAM, which can reduce your memory usage.

And if you're doing top-k, threshold-filtered, or the greedy clustering modes, there is additional memory allocated to build LSH tables over the data to pull up near neighbors, which can be rather significant.

Best,

Daniel

On Wed, Jul 19, 2023 at 3:49 PM Matthew Nguyen @.***> wrote:

What's the expected memory usage per genome for Dashing2? I'm trying to run it on 500,000 viral isolates, and am running out of memory even with 500GB

— Reply to this email directly, view it on GitHub https://github.com/dnbaker/dashing2/issues/80, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5UVMHNOOLZHDI4GCKGVLXRBQBBANCNFSM6AAAAAA2QSLQFY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

matnguyen commented 1 year ago

This is the command I'm running:

dashing2_savx2 sketch --cmpout dist_mat.txt -k 7 --parse-by-seq -p 32 sequences.multi.fa
dnbaker commented 1 year ago

Thank you! This is a big help.

There's one other place memory is used in --parse-by-seq mode: storing the sequences of each read. What's happening here is that the whole fasta ends up being stored in memory.

I have to say this isn't desirable behavior for most cases. If edit distance is chosen for an output distance or the program is running in greedy clustering mode, then the program needs to hold on to the sequences for later use, but otherwise it doesn't need to hold on to them.

I need to do a bit of work to reorder this to avoid this problem; I think I have a path to do it, but it will take a bit of reorganization.

I'll update you when there's a fix for this.

Thanks again,

Daniel

dnbaker commented 1 year ago

Checking back in - this is improved with https://github.com/dnbaker/dashing2/pull/81. I'm rebuilding v2.1.18 binaries currently and will update you when they're ready.

Memory usage should be a ower for --parse-by-seq mode. It won't hold onto sequences it doesn't need.

Would you give it another try?

Thanks!

Daniel