bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

Alternatives for .skf format #17

Closed johnlees closed 12 months ago

johnlees commented 1 year ago

Look at blocked/distributed hash maps

johnlees commented 1 year ago

Key-value store (database) options:

Probably faster:

johnlees commented 1 year ago

Initial trial with 28 samples and ~3M k-mers

    // 28 listeria
    // 1 thread 11s, 2.3Gb
    // 8 threads 5s, 3Gb
    // 3228084 k-mers
    // 38Mb skf file

Reading all merge ska array k-mer and vars in (fill) and changing one variant (modify)

Hash fill: 284ms modify: 95ms sled fill: 21113ms modify: 16751ms 641Mb

With sled compression sled fill: 37990ms modify: 21814ms 433Mb

johnlees commented 1 year ago

If doing this kind of approach, using a BTreeMap and reading/writing blocks of given range (and probably async) would be better. Would also want to deserialise without buffering, see: https://serde.rs/stream-array.html.

Some difficulties here are that values are not likely to be equally spaced in the btree, and doing blocked reads with serde isn't easily supported. A Vec<u64> which is read manually from disk might be a better solution.

I think the algorithm for build_and_merge would be something more like:

  1. read a file into a hashmap
  2. merge into the dict in blocks (using seek to get correct parts of file)

Two more things to try first:

  1. Change build and merge so that files are read then merged one at a time, rather than all read, then all merged. (could then make these blocked, to give some threading)
  2. Try the memmapped hashmap to see if it's possible
johnlees commented 1 year ago

The first, and simplest, change to make would be to combine the read and merge steps as they are. i.e. inside the append loop also do the build there. That keeps the current parallelism approach

johnlees commented 1 year ago

Using a memmap odht works and is fast:

let mut mmap = unsafe { MmapMut::map_mut(&file).unwrap() };
let mut mmap_hashtable = HashTable::<MyConfig, &mut [u8]>::init_in_place(mmap.as_mut(), max_items, load).unwrap();

memmap fill: 199ms modify: 234ms size = 296M

But:

Basically memory map seems to be the wrong choice when writing the whole file. If it could be changed to be a BTree of blocks it might work. Going to leave this for now

johnlees commented 1 year ago

https://github.com/wspeirs/btree looks like it might work here