bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

Do you have an estimated RAM usage per sample? #23

Closed danrlu closed 1 year ago

danrlu commented 1 year ago

I'm really excited to see the Rust version. Thank you!!

With the original SKA, the memory was a limiting factor on how many samples we could process in 1 go (see here https://github.com/simonrharris/SKA/issues/26). Do you have any thoughts on the relationship between number of samples and memory usage for the new implementation?

Thanks!

johnlees commented 1 year ago

With k<=31 you can expect something like 10 bytes per base to be used. The more k-mers overlap between different samples the less RAM will be used when they are merged -- merged .skf files use 8 bytes for each k-mer plus 1 byte for every samples. If you are looking within e.g. a bacterial strain I've found you might only get 10-20% more k-mers than a single sample, but across more diverse samples (where ska works less well) it can easily be 10-fold more. Also note that the .skf files are compressed so they use less space, but at the moment everything is loaded into main memory for processing. Some times here: https://github.com/bacpop/ska.rust/pull/19#issue-1590318456

It would be possible to make a disk streaming version of the algorithm that uses less RAM. I had a play around with a few things here: https://github.com/bacpop/ska.rust/issues/17 I think I know how to do it, but it would take a fair amount of time so at the moment it's on my 'if many people ask for it' list.

danrlu commented 1 year ago

That's very major improvement. It's likely sufficient for the number of strains (same species) we have 🤩

Thank you!

johnlees commented 6 months ago

@danrlu sorry I'm not sure of the best way to contact you – we are writing this up as a paper now and I'd like to include you as an author. If you're interested please shoot me an email (see my contact details here)