Closed StephanDollberg closed 5 years ago
We would have to do some internal testing on it before we would accept a change like this, but it is worth considering
When you say "page fault", do you actually mean page faults or Ln cache misses? As in, are you faulting on the pages from the virtual address space not being in real memory pages? Or are you suffering from CPU cache misses from the cache lines missing from the cache? These problems are different and would require different solutions.
Thanks guys,
@sbenzaquen Yes, I am talking about pagefaults and not cache misses
@fowles
__builtin_prefetch
or something else? That would also pull the memory into cache, no? I guess something like a mlock followed by a munlock would work but that can be hard to use in production as it's limited by ulimits.In general, I'd be happy with having an additional utility function similar to prefetch
to prefault the slots array manually. Currently I don't see a way of doing that with the current API. I was thinking of iterating over the map but that only touches the control array.
http://go/gh/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h#L1378 might be sufficient for you
rehash might touch as little as 50% of the memory. If you have objects of cacheline size and are close to various load factors. But you are probably correct that the reserve + immediately fill will hit most of the memory anyway. But if you reserve is >L3 cache size, you will flush everything else out of your caches which will be problematic when you next try to do work
If it is page faults what you are talking about, then I am not sure that benchmark is representative. Those pages will already be faulted in after the first iteration of the benchmark. (assuming the memory allocator reuses the blocks that were just released in the previous iteration, but that should be a good assumption)
@fowles Unfortunately, prefetch()
doesn't work for me because it would require knowing all keys in advance.
@sbenzaquen That's a good point, so you think the speed up actually comes from something else? However, I don't think that's the case. If you perf trace the benchmark, you see page_faults coming from fill_map
till the very end. Also you see the page faults very clearly in a profile and out of fill_map
in memset with the patch applied.
See https://gist.github.com/StephanDollberg/107244a11b5e413779a6a9fd6e96b90a (need to right click and click get image address if you want to properly zoom in and use the svg, Github being a bit of a pain here)
If it is really page faults you could try something else. Instead of memset the whole thing, just write one byte per page. That would not flush the L1 cache as much but should cause the pages to be paged in. But I don't think this is something that can be easily tested in a microbenchmark unless you use a different allocator.
@StephanDollberg Have you tried using a custom allocator which wraps std::allocator
Well that obviously works. I just don't think that's best user experience.
While that's not a great user experience for you unfortunately, we're going to set this to wontfix since it's pretty much a one line workaround to the problem. Please feel free to re-open if you feel like your point hasn't been heard of you have more to say.
Hello,
I am seeing a performance issue in
flat_hash_map
and friends where we are getting pagefaults on theslots_
array on first inserts or finds into the map.It's the common scenario of having an initial setup phase with a rough idea of how big the hashmap will be and
reserve()
ing the expected size and then on the hot path elements are inserted and looked up. In the current implementation the first inserts/finds cause page faults as theslots_
array is not touched in rehash (assuming it's empty to start with).With a node based implementation like
std::unordered_map
you can't avoid the page fault as you are allocating on each insert anyway. However, with a flat implementation no allocations take place on insert so we can prefault the whole backing array and avoid pagefaults on the hot path.Doing something like https://github.com/abseil/abseil-cpp/commit/d2359c4a3ffd81e2e9a58af012028055f16427b4 fixes this problem and moves the page faults into the setup. I have a small reproducer benchmark here https://gist.github.com/StephanDollberg/62097375c863687490eb027b5941f87c showing a 10-20% improvement in this scenario. This roughly matches what I am seeing in a similar prod scenario.
Has something like this ever been discussed?
Thanks, Stephan