Closed johnlees closed 12 months ago
Key-value store (database) options:
Probably faster:
Initial trial with 28 samples and ~3M k-mers
// 28 listeria
// 1 thread 11s, 2.3Gb
// 8 threads 5s, 3Gb
// 3228084 k-mers
// 38Mb skf file
Reading all merge ska array k-mer and vars in (fill) and changing one variant (modify)
Hash fill: 284ms modify: 95ms sled fill: 21113ms modify: 16751ms 641Mb
With sled compression sled fill: 37990ms modify: 21814ms 433Mb
If doing this kind of approach, using a BTreeMap and reading/writing blocks of given range (and probably async) would be better. Would also want to deserialise without buffering, see: https://serde.rs/stream-array.html.
Some difficulties here are that values are not likely to be equally spaced in the btree, and doing blocked reads with serde isn't easily supported. A Vec<u64>
which is read manually from disk might be a better solution.
I think the algorithm for build_and_merge would be something more like:
Two more things to try first:
The first, and simplest, change to make would be to combine the read and merge steps as they are. i.e. inside the append loop also do the build there. That keeps the current parallelism approach
Using a memmap odht works and is fast:
let mut mmap = unsafe { MmapMut::map_mut(&file).unwrap() };
let mut mmap_hashtable = HashTable::<MyConfig, &mut [u8]>::init_in_place(mmap.as_mut(), max_items, load).unwrap();
memmap fill: 199ms modify: 234ms size = 296M
But:
Basically memory map seems to be the wrong choice when writing the whole file. If it could be changed to be a BTree of blocks it might work. Going to leave this for now
https://github.com/wspeirs/btree looks like it might work here
Look at blocked/distributed hash maps