Perf seems to say cache misses.

joerowell commented 3 years ago

I think I was wrong: it looks like a cache miss problem.

I have no idea why so many pop up: I've tried a few different configurations (e.g splitting the loop so the wrap-around doesn't confuse the pre-fetcher), and they all seem to give results like the following. This effect seems to be present at even table sizes that can fit into my rather small cache (4MB).

For test_ok (ignore the iTLB-load-misses: the counter doesn't work)

tablesize: 117440512 elements: 500000 loadfactor=0.00425747
- test insert_ok : 70ms

 Performance counter stats for './test_binary':

          1,272.03 msec task-clock                #    1.000 CPUs utilized          
                 1      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           788,484      page-faults               #    0.620 M/sec                  
     4,471,546,297      cycles                    #    3.515 GHz                      (30.49%)
     3,345,514,675      instructions              #    0.75  insn per cycle           (38.35%)
       543,906,346      branches                  #  427.588 M/sec                    (38.67%)
           721,118      branch-misses             #    0.13% of all branches          (38.98%)
       810,121,388      L1-dcache-loads           #  636.871 M/sec                    (39.30%)
        86,674,182      L1-dcache-load-misses     #   10.70% of all L1-dcache accesses  (39.31%)
         2,470,728      LLC-loads                 #    1.942 M/sec                    (31.15%)
         1,122,441      LLC-load-misses           #   45.43% of all LL-cache accesses  (30.83%)
   <not supported>      L1-icache-loads                                             
         2,232,660      L1-icache-load-misses                                         (30.52%)
       748,086,815      dTLB-loads                #  588.103 M/sec                    (30.20%)
         1,476,844      dTLB-load-misses          #    0.20% of all dTLB cache accesses  (30.19%)
             6,175      iTLB-loads                #    0.005 M/sec                    (30.19%)
         1,610,066      iTLB-load-misses          # 26073.94% of all iTLB cache accesses  (30.19%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       1.272476311 seconds time elapsed

       0.440133000 seconds user
       0.832252000 seconds sys

For test_bad:

tablesize: 117440512 elements: 500000 loadfactor=0.00425747
- test insert_bad: 15ms

 Performance counter stats for './test_binary':

          1,200.27 msec task-clock                #    1.000 CPUs utilized          
                 6      context-switches          #    0.005 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
           788,484      page-faults               #    0.657 M/sec                  
     4,286,761,398      cycles                    #    3.572 GHz                      (30.96%)
     3,606,370,073      instructions              #    0.84  insn per cycle           (38.69%)
       574,402,741      branches                  #  478.562 M/sec                    (38.69%)
           734,593      branch-misses             #    0.13% of all branches          (38.69%)
       790,892,690      L1-dcache-loads           #  658.931 M/sec                    (38.69%)
        86,610,638      L1-dcache-load-misses     #   10.95% of all L1-dcache accesses  (38.38%)
         2,165,358      LLC-loads                 #    1.804 M/sec                    (30.66%)
           888,020      LLC-load-misses           #   41.01% of all LL-cache accesses  (30.66%)
   <not supported>      L1-icache-loads                                             
         1,685,232      L1-icache-load-misses                                         (30.66%)
       765,002,012      dTLB-loads                #  637.360 M/sec                    (30.66%)
         2,110,749      dTLB-load-misses          #    0.28% of all dTLB cache accesses  (30.65%)
             9,347      iTLB-loads                #    0.008 M/sec                    (30.65%)
         1,546,789      iTLB-load-misses          # 16548.51% of all iTLB cache accesses  (30.65%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       1.200824975 seconds time elapsed

       0.364062000 seconds user
       0.836142000 seconds sys

I think the most notable of these is that the test_bad result has far fewer LLC-load-misses (both as a percentage and in total).

This makes me think that there's a weird pre-fetching advantage here for the test_bad case. I have no idea what causes that.

I also tried placing the size variable from bucket_t into their own dedicated aligned vector, but no joy there either.

cr-marcstevens commented 3 years ago

With a completely random access pattern in buckets, of course it's a huge cache miss problem. That's why all 3 versions try to access only a single cacheline (moving size to a different structure would imply at least two cacheline accesses). I'm also looking into perf, but not familiar yet sure where to look exactly.

Though I am afraid your results might not be very telling. The actual running time of the loop is such a small fraction of the running time of the executable, the total perf numbers might mainly reflect effects from all other instructions executed.

travisdowns commented 3 years ago

The reason the bad/alt show fewer misses is sort of a quirk of the way these "instructions retired [miss state]" events work in the presence of branch mispredictions.

These events count once for each load instruction (or instruction with an embedded load) what the result of the load was: L1 hit, L2 hit, L3 hit, L3 miss, etc. However, an instruction may actually execute several times due to mispreidctions: the CPU predicts a branch to go some way and keeps executing the instructions after that, but if the guess is wrong, the instructions are cancelled and retried with the right prediction (this can happen more than once since there might be another branch that gets mispredicted that was in the shadow of the first one). In this case, the perf event doesn't count anything about the first time the instruction executed, only the final one, where the instruction retires (i.e., when it completes successfully without being cancelled by a mispredict or any other type of speculation failure).

Whew! So what this means is that if you have a lot of misses and also mispredicts, many of the misses will happen in the shadow of a branch misprediction, but then the instruction gets cancelled (but the miss keeps being handled in the memory subsystem, you can't cancel that), and by the time the instruction runs again it is transformed into some kind of hit.

This is noticeable if you contrast with the counters that work at the cache level, counting all accesses or misses and not tied to an instruction. These are often much higher than the "retired" counters.

cr-marcstevens / hashtable_mystery

Perf seems to say cache misses. #1