raw_hash_set: slots_ array not being prefaulted on reserve/rehash

StephanDollberg commented 5 years ago

Hello,

I am seeing a performance issue in flat_hash_map and friends where we are getting pagefaults on the slots_ array on first inserts or finds into the map.

It's the common scenario of having an initial setup phase with a rough idea of how big the hashmap will be and reserve()ing the expected size and then on the hot path elements are inserted and looked up. In the current implementation the first inserts/finds cause page faults as the slots_ array is not touched in rehash (assuming it's empty to start with).

With a node based implementation like std::unordered_map you can't avoid the page fault as you are allocating on each insert anyway. However, with a flat implementation no allocations take place on insert so we can prefault the whole backing array and avoid pagefaults on the hot path.

Doing something like https://github.com/abseil/abseil-cpp/commit/d2359c4a3ffd81e2e9a58af012028055f16427b4 fixes this problem and moves the page faults into the setup. I have a small reproducer benchmark here https://gist.github.com/StephanDollberg/62097375c863687490eb027b5941f87c showing a 10-20% improvement in this scenario. This roughly matches what I am seeing in a similar prod scenario.

Has something like this ever been discussed?

Thanks, Stephan

fowles commented 5 years ago

std::memset is an expensive way to do a prefetch, you should issue a prefetch directly for a case like this
this code could be a slow down in production scenarios as it drags the entire slot array through cache immediately (thus increasing cache pressure).

We would have to do some internal testing on it before we would accept a change like this, but it is worth considering

sbenzaquen commented 5 years ago

When you say "page fault", do you actually mean page faults or Ln cache misses? As in, are you faulting on the pages from the virtual address space not being in real memory pages? Or are you suffering from CPU cache misses from the cache lines missing from the cache? These problems are different and would require different solutions.

StephanDollberg commented 5 years ago

Thanks guys,

@sbenzaquen Yes, I am talking about pagefaults and not cache misses

@fowles

Yeah, the memset simply did the job for me and given that it's not on the hot path I didn't care too much. Though, are you talking about doing __builtin_prefetch or something else? That would also pull the memory into cache, no? I guess something like a mlock followed by a munlock would work but that can be hard to use in production as it's limited by ulimits.
I could imagine that being the case yeah, but then if you are rehashing you will probably touch all memory anyway and reserve should not be on the hotpath anyway. But one of the reasons I opened this issue because I wanted to start such a discussion and I would be very interested in some numbers from your side.

In general, I'd be happy with having an additional utility function similar to prefetch to prefault the slots array manually. Currently I don't see a way of doing that with the current API. I was thinking of iterating over the map but that only touches the control array.

fowles commented 5 years ago

http://go/gh/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h#L1378 might be sufficient for you

fowles commented 5 years ago

rehash might touch as little as 50% of the memory. If you have objects of cacheline size and are close to various load factors. But you are probably correct that the reserve + immediately fill will hit most of the memory anyway. But if you reserve is >L3 cache size, you will flush everything else out of your caches which will be problematic when you next try to do work

sbenzaquen commented 5 years ago

If it is page faults what you are talking about, then I am not sure that benchmark is representative. Those pages will already be faulted in after the first iteration of the benchmark. (assuming the memory allocator reuses the blocks that were just released in the previous iteration, but that should be a good assumption)

StephanDollberg commented 5 years ago

@fowles Unfortunately, prefetch() doesn't work for me because it would require knowing all keys in advance.

@sbenzaquen That's a good point, so you think the speed up actually comes from something else? However, I don't think that's the case. If you perf trace the benchmark, you see page_faults coming from fill_map till the very end. Also you see the page faults very clearly in a profile and out of fill_map in memset with the patch applied.

See https://gist.github.com/StephanDollberg/107244a11b5e413779a6a9fd6e96b90a (need to right click and click get image address if you want to properly zoom in and use the svg, Github being a bit of a pain here)

sbenzaquen commented 5 years ago

If it is really page faults you could try something else. Instead of memset the whole thing, just write one byte per page. That would not flush the L1 cache as much but should cause the pages to be paged in. But I don't think this is something that can be easily tested in a microbenchmark unless you use a different allocator.

grazingiiyama commented 5 years ago

@StephanDollberg Have you tried using a custom allocator which wraps std::allocator, changing allocate to call std::allocator::allocate and then pre-fault the raw memory before returning?

StephanDollberg commented 5 years ago

Well that obviously works. I just don't think that's best user experience.

JonathanDCohen commented 5 years ago

While that's not a great user experience for you unfortunately, we're going to set this to wontfix since it's pretty much a one line workaround to the problem. Please feel free to re-open if you feel like your point hasn't been heard of you have more to say.

abseil / abseil-cpp

raw_hash_set: slots_ array not being prefaulted on reserve/rehash #239