Open ctb opened 7 years ago
The tl;dr is that if we do nothing but increase memory/table size by 10x, we slow down k-mer loading by a factor of 7x.
See comments https://github.com/dib-lab/khmer/pull/1553#issuecomment-266786982 and a rundown of two possible solutions in https://github.com/dib-lab/khmer/pull/1553#issuecomment-268798517. Note that speeding up insertion will still not change the speed of retrieval, which should be subject to the same badness.
The ultimate goal should probably be to use more cache-coherent data structures (see some comments in e.g. https://github.com/dib-lab/khmer/issues/1215 and https://github.com/dib-lab/khmer/issues/1198).
@rob-p has some cool stuff going on with efficient alternatives to (counting) Bloom filters; I wonder if he would be interested in solving our problems for us? Should he stumble across this comment, I would direct his attention to our newly factored out Storage
class in storage.hh, and point out that we only need a get_count
and an add
implementation to try out new data types.
Ahh, I see @rob-p's cool stuff is available! https://github.com/splatlab/cqf thx @luizirber
This paper popped up on Biorxiv the other day, linking to this technical report describing the CQF in detail.
Some benchmarking of my own: https://gist.github.com/betatim/24886949ded51b90a3ca066c2a7f0439
I ran the benchmark with the process pinned to a CPU (and then varied which one it is pinned to). The thinking is that by pinning our (single threaded) process to a certain CPU we don't end up being scheduled on a different socket with caches that contain nothing useful for us. The best you can do is pin the process to a CPU and that doesn't seem to change anything. So I am unconvinced by the NUMA explanation (or missing something).
Last plot shows the runtime when you subtract the time to run the same benchmark on a file containing only 20kmers. This is an attempt at measuring how long it takes to allocate and zero out the memory we request. I think for reasonable sizes (read: large) of memory this takes a linear amount of time. So some of the linear slow down is because it takes time to allocate the memory.
Edited the gist to add a few more points for the "delta" plot. I'd now claim that up to ~1.5GB it takes a constant amount of time to process the data once you subtract the 'startup' time.
What happens if you change -N
from 1 to 4?
not quite sure what this tells us. Somehow measures the extra cost of computing N indices and accessing that bit of memory
well, more N means more compute, so this all makes straightforward sense to me...
Another point: https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/ But it doesn't help us much at all: we don't have extra compute to do while waiting for memory access...
In the vein of @betatim's I/O experiments #1554, I did some benchmarking on my laptop with hashsizes and load times; results here:
https://gist.github.com/ctb/beb9a7eebf1a1282562ff0272efb4dd6
On my laptop between the ranges of 5 MB and 2 GB we get a nearly linear slowdown (m=0.7) in sequencefile => counttable.
This is likely due to some nasty combination of NUMA and cache (in)coherence of our Bloom filters/CountMin sketches.