Closed danrlu closed 1 year ago
With k<=31 you can expect something like 10 bytes per base to be used. The more k-mers overlap between different samples the less RAM will be used when they are merged -- merged .skf files use 8 bytes for each k-mer plus 1 byte for every samples. If you are looking within e.g. a bacterial strain I've found you might only get 10-20% more k-mers than a single sample, but across more diverse samples (where ska works less well) it can easily be 10-fold more. Also note that the .skf files are compressed so they use less space, but at the moment everything is loaded into main memory for processing. Some times here: https://github.com/bacpop/ska.rust/pull/19#issue-1590318456
It would be possible to make a disk streaming version of the algorithm that uses less RAM. I had a play around with a few things here: https://github.com/bacpop/ska.rust/issues/17 I think I know how to do it, but it would take a fair amount of time so at the moment it's on my 'if many people ask for it' list.
That's very major improvement. It's likely sufficient for the number of strains (same species) we have 🤩
Thank you!
I'm really excited to see the Rust version. Thank you!!
With the original SKA, the memory was a limiting factor on how many samples we could process in 1 go (see here https://github.com/simonrharris/SKA/issues/26). Do you have any thoughts on the relationship between number of samples and memory usage for the new implementation?
Thanks!