Memory allocation vs file load time: the cost of NUMA?

dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more

http://khmer.readthedocs.io/

Other

748 stars 295 forks source link

Memory allocation vs file load time: the cost of NUMA? #1665

Open ctb opened 7 years ago

ctb commented 7 years ago

In the vein of @betatim's I/O experiments #1554, I did some benchmarking on my laptop with hashsizes and load times; results here:

https://gist.github.com/ctb/beb9a7eebf1a1282562ff0272efb4dd6

On my laptop between the ranges of 5 MB and 2 GB we get a nearly linear slowdown (m=0.7) in sequencefile => counttable.

This is likely due to some nasty combination of NUMA and cache (in)coherence of our Bloom filters/CountMin sketches.

ctb commented 7 years ago

The tl;dr is that if we do nothing but increase memory/table size by 10x, we slow down k-mer loading by a factor of 7x.

See comments https://github.com/dib-lab/khmer/pull/1553#issuecomment-266786982 and a rundown of two possible solutions in https://github.com/dib-lab/khmer/pull/1553#issuecomment-268798517. Note that speeding up insertion will still not change the speed of retrieval, which should be subject to the same badness.

The ultimate goal should probably be to use more cache-coherent data structures (see some comments in e.g. https://github.com/dib-lab/khmer/issues/1215 and https://github.com/dib-lab/khmer/issues/1198).

ctb commented 7 years ago

@rob-p has some cool stuff going on with efficient alternatives to (counting) Bloom filters; I wonder if he would be interested in solving our problems for us? Should he stumble across this comment, I would direct his attention to our newly factored out Storage class in storage.hh, and point out that we only need a get_count and an add implementation to try out new data types.

ctb commented 7 years ago

Ahh, I see @rob-p's cool stuff is available! https://github.com/splatlab/cqf thx @luizirber

standage commented 7 years ago

This paper popped up on Biorxiv the other day, linking to this technical report describing the CQF in detail.

betatim commented 7 years ago

Some benchmarking of my own: https://gist.github.com/betatim/24886949ded51b90a3ca066c2a7f0439

I ran the benchmark with the process pinned to a CPU (and then varied which one it is pinned to). The thinking is that by pinning our (single threaded) process to a certain CPU we don't end up being scheduled on a different socket with caches that contain nothing useful for us. The best you can do is pin the process to a CPU and that doesn't seem to change anything. So I am unconvinced by the NUMA explanation (or missing something).

Last plot shows the runtime when you subtract the time to run the same benchmark on a file containing only 20kmers. This is an attempt at measuring how long it takes to allocate and zero out the memory we request. I think for reasonable sizes (read: large) of memory this takes a linear amount of time. So some of the linear slow down is because it takes time to allocate the memory.

betatim commented 7 years ago

Edited the gist to add a few more points for the "delta" plot. I'd now claim that up to ~1.5GB it takes a constant amount of time to process the data once you subtract the 'startup' time.

betatim commented 7 years ago

What happens if you change -N from 1 to 4?

not quite sure what this tells us. Somehow measures the extra cost of computing N indices and accessing that bit of memory

ctb commented 7 years ago

well, more N means more compute, so this all makes straightforward sense to me...

luizirber commented 7 years ago

Another point: https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/ But it doesn't help us much at all: we don't have extra compute to do while waiting for memory access...