Open joerowell opened 3 years ago
With a completely random access pattern in buckets, of course it's a huge cache miss problem.
That's why all 3 versions try to access only a single cacheline (moving size
to a different structure would imply at least two cacheline accesses).
I'm also looking into perf
, but not familiar yet sure where to look exactly.
Though I am afraid your results might not be very telling. The actual running time of the loop is such a small fraction of the running time of the executable, the total perf numbers might mainly reflect effects from all other instructions executed.
The reason the bad/alt show fewer misses is sort of a quirk of the way these "instructions retired [miss state]" events work in the presence of branch mispredictions.
These events count once for each load instruction (or instruction with an embedded load) what the result of the load was: L1 hit, L2 hit, L3 hit, L3 miss, etc. However, an instruction may actually execute several times due to mispreidctions: the CPU predicts a branch to go some way and keeps executing the instructions after that, but if the guess is wrong, the instructions are cancelled and retried with the right prediction (this can happen more than once since there might be another branch that gets mispredicted that was in the shadow of the first one). In this case, the perf event doesn't count anything about the first time the instruction executed, only the final one, where the instruction retires (i.e., when it completes successfully without being cancelled by a mispredict or any other type of speculation failure).
Whew! So what this means is that if you have a lot of misses and also mispredicts, many of the misses will happen in the shadow of a branch misprediction, but then the instruction gets cancelled (but the miss keeps being handled in the memory subsystem, you can't cancel that), and by the time the instruction runs again it is transformed into some kind of hit.
This is noticeable if you contrast with the counters that work at the cache level, counting all accesses or misses and not tied to an instruction. These are often much higher than the "retired" counters.
I think I was wrong: it looks like a cache miss problem.
I have no idea why so many pop up: I've tried a few different configurations (e.g splitting the loop so the wrap-around doesn't confuse the pre-fetcher), and they all seem to give results like the following. This effect seems to be present at even table sizes that can fit into my rather small cache (4MB).
For test_ok (ignore the iTLB-load-misses: the counter doesn't work)
For test_bad:
I think the most notable of these is that the
test_bad
result has far fewer LLC-load-misses (both as a percentage and in total).This makes me think that there's a weird pre-fetching advantage here for the
test_bad
case. I have no idea what causes that.I also tried placing the
size
variable frombucket_t
into their own dedicated aligned vector, but no joy there either.