Add TLB aliasing example

btolsch commented 5 years ago

This is in reference to #4. It makes direct use of perf events in Linux for measuring TLB misses, but also times each run. Overall, the results I got seem to make sense and are outlined in the REAMDE.md, but there are a few cases that don't (like ./tlb-aliasing 2048 1 doesn't give close to 2048 misses per iteration, which I would expect in a 1024 entry TLB).

PTAL, thanks!

Kobzol commented 5 years ago

Thank you, this is awesome! :balloon:

Would you mind if I refactored it a little to bring it closer to the other examples? In particular, I'd create a benchmark script to test the various input combinations, refactor the code a bit more into C++ (not that it's of any extra use here, just to be consistent with the rest of the examples).

I would also probably remove the perf measurements from the code. It's more precise than just using perf on the whole executable, but it's not portable and frankly I feel like it makes the code a bit magic for people that don't know about it. My point with the examples here is to pretty much ignore absolute measurements and just focus on relative differences between different input parameters to the program. In that spirit I think that running perf outside is enough to show the differences. It will not be very precise for very small number of misses, but the relative differences with bigger numbers will be visible in the end.

Also a hint could be added to the README that the cpuid program on Linux can help with determining TLB layout.

I suspect that with large strides some performance will be lost because the hardware prefetcher will not be able to prefetch such large strides. Here the L1 cache miss is twice as big with a larger stride and same count and the runtime is twice slower. This may overshadow the cost of the TLB misses.

$ perf stat -edTLB-load-misses,dTLB-store-misses,L1-dcache-load-misses tlb-alias 2048 1
2308
1086.64 misses per repetition (217327925 total)

       271 524 493      dTLB-load-misses                                     (79,92%)
       138 841 752      dTLB-store-misses                                    (79,96%)
       568 777 142      L1-dcache-load-misses                             (57,20%)

$ perf stat -edTLB-load-misses,dTLB-store-misses,L1-dcache-load-misses tlb-alias 2048 16
4661
1206.49 misses per repetition (241297065 total)

       300 500 674      dTLB-load-misses                                      (80,00%)
       108 117 375      dTLB-store-misses                                     (79,98%)
       903 033 508      L1-dcache-load-misses                              (57,12%)

btolsch commented 5 years ago

Refactoring for consistency is fine with me. I wasn't aware of the cpuid program on Linux, but I guess still including tlb-info could be useful on Windows, unless CPU-Z or something similar already does that.

Kobzol commented 5 years ago

I agree, though this won't work directly on Windows, so it might as well link to something like this: https://stackoverflow.com/a/4823889/1107768.

Kobzol commented 5 years ago

I put the modified version here: https://github.com/Kobzol/hardware-effects/tree/tlb-aliasing. With % page_size, it behaves differently than before. Could you please test your previous hypotheses and results either with my branch or with % page_size?

btolsch commented 5 years ago

I tried your branch vs. mine and with perf stat they behave almost identically (within normal run-to-run deviations) in all the examples I gave in the README.md. What results were you getting that was different between the two branches?

Kobzol commented 5 years ago

With the % block_size I got some weird numbers with count/stride combinations that I couldn't explain (I don't remember the exact combinations). Let's work with the % page_size since that is probably what we want to test.

I can't reproduce your numbers for the L2 TLB. According to wikichip and cpuid, my CPU (Kaby Lake) should have 12-way associative shared LTB with 1536 entries, therefore there should be 128 sets and increments of 128 should alias.

But running 12 128 and 16 128 and so on doesn't change the number of misses much (neither from your manual perf_event_open nor from perf stat).

Either I've made a mistake somewhere or the TLB is using a different strategy for indexing into the cache (hashing?). Do you still get massive increments in TLB misses when going with count over 8 with the % page_size version?

I disabled (transparent) hugepages, that's the only thing that I know of that could influence it.

btolsch commented 5 years ago

The results I get comparing the two branches mostly differ in cache miss counts, but otherwise, yes I still get a jump in L2 TLB misses going from 8 128, to 12 128, to 16 128 on my Haswell CPU.

How do these compare on your CPU: 1536 1, 2304 1, and 3072 1?

Kobzol commented 5 years ago

This stuff is weird, I'm getting totally different results with your and my code. I tracked it down to the method of allocation:

void* mem =
      mmap((void*)Tebibytes(2ul), block_size * count, PROT_READ | PROT_WRITE,
           MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, -1, 0);

I would expect MAP_PRIVATE here, any reason why you used MAP_SHARED?

With MAP_PRIVATE and your code:

tlb-aliasing 1536 1 -> ~500 misses
tlb-aliasing 2304 1 -> ~2800 misses
tlb-aliasing 3072 1 -> ~500 misses

With MAP_SHARED and your code:

tlb-aliasing 1536 1 -> ~2 000 000 misses
tlb-aliasing 2304 1 -> ~150 000 000 misses
tlb-aliasing 3072 1 -> ~200 000 000 misses

btolsch commented 5 years ago

That's definitely even weirder. I didn't have a reason for choosing MAP_SHARED, but there should be no difference between these since there's no multithreading or IPC happening. I'm also not getting the differing results that you are seeing between the two flags.

Kobzol commented 5 years ago

I changed the program to receive the MAP_PRIVATE/MAP_SHARED via input args, so that the binary is the same in both cases, this is the result I get (the last parameter is 1 for MAP_SHARED and 2 for MAP_PRIVATE):

$ perf stat -edTLB-load-misses,dTLB-store-misses tlb-aliasing 1536 1 1
14.91 misses per repetition (1490712 total)
159706 us

         1 493 566      dTLB-load-misses                                            
         3 380 116      dTLB-store-misses

$  perf stat -edTLB-load-misses,dTLB-store-misses tlb-aliasing 1536 1 2
0.00 misses per repetition (16 total)
213354 us

               522      dTLB-load-misses                                            
                92      dTLB-store-misses

Even if I add MAP_POPULATE the results are the same. With MAP_PRIVATE I only get a high number of misses when I increase the offset to several hundreds, but it's still an order of magnitude less than with MAP_SHARED.

btolsch commented 5 years ago

Is your branch behaving like mine with MAP_PRIVATE or MAP_SHARED? I still have absolutely no idea why the TLB misses would be different, but yours should behave like MAP_PRIVATE (since glibc's malloc should defer to mmap with MAP_PRIVATE for an allocation of this size, which you can verify with /proc/[pid]/maps).

Kobzol commented 5 years ago

It's private, yet I got almost the same results as your version with MAP_SHARED.

I checked out a clean copy of your branch and tested it today with both MAP_PRIVATE and MAP_SHARED and now it behaves reasonably. I don't remember whether it was doing this on my local or work PC, but let's not dabble in it, I probably made a mistake somewhere along the way.

Sorry for that.

I tested multiple configurations and got these results:

19.20 misses per repetition (1919587 total)
208376 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 1536 1':

       309 409 945      dTLB-loads                                                    (65,60%)
         2 157 857      dTLB-load-misses          #    0,70% of all dTLB cache hits   (65,59%)
         2 465 238      dTLB-misses               #    0,80% of all dTLB cache hits   (66,89%)
         3 660 753      dTLB-store-misses                                             (68,55%)
         4 393 181      dtlb_store_misses.miss_causes_a_walk                                     (67,76%)

       0,209603618 seconds time elapsed

1021.20 misses per repetition (102120216 total)
1331882 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 2304 1':

       463 318 338      dTLB-loads                                                    (66,38%)
       153 831 597      dTLB-load-misses          #   33,20% of all dTLB cache hits   (66,59%)
       153 920 401      dTLB-misses               #   33,22% of all dTLB cache hits   (66,90%)
        77 072 478      dTLB-store-misses                                             (66,99%)
        92 099 645      dtlb_store_misses.miss_causes_a_walk                                     (66,72%)

       1,333054803 seconds time elapsed

1371.14 misses per repetition (137113934 total)
1886300 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 3072 1':

       616 466 708      dTLB-loads                                                    (66,51%)
       207 158 225      dTLB-load-misses          #   33,60% of all dTLB cache hits   (66,52%)
       207 395 001      dTLB-misses               #   33,64% of all dTLB cache hits   (66,74%)
       101 289 174      dTLB-store-misses                                             (66,95%)
       122 304 010      dtlb_store_misses.miss_causes_a_walk                                     (66,75%)

       1,887647856 seconds time elapsed

0.00 misses per repetition (0 total)
862 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 12 128':

         2 607 518      dTLB-loads                                                  
               722      dTLB-load-misses          #    0,03% of all dTLB cache hits 
               722      dTLB-misses               #    0,03% of all dTLB cache hits 
               244      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,001625369 seconds time elapsed

0.00 misses per repetition (0 total)
1116 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 16 128':

         3 415 369      dTLB-loads                                                  
               679      dTLB-load-misses          #    0,02% of all dTLB cache hits 
               679      dTLB-misses               #    0,02% of all dTLB cache hits 
               250      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,001815701 seconds time elapsed

0.00 misses per repetition (0 total)
2137 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 24 128':

         5 027 968      dTLB-loads                                                  
               944      dTLB-load-misses          #    0,02% of all dTLB cache hits 
               944      dTLB-misses               #    0,02% of all dTLB cache hits 
               314      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,002928527 seconds time elapsed

0.00 misses per repetition (0 total)
836 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 12 32':

         2 607 048      dTLB-loads                                                  
               712      dTLB-load-misses          #    0,03% of all dTLB cache hits 
               712      dTLB-misses               #    0,03% of all dTLB cache hits 
               251      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,001495578 seconds time elapsed

0.00 misses per repetition (0 total)
1605 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 24 32':

         5 016 021      dTLB-loads                                                  
               811      dTLB-load-misses          #    0,02% of all dTLB cache hits 
               811      dTLB-misses               #    0,02% of all dTLB cache hits 
               292      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,002236586 seconds time elapsed

0.00 misses per repetition (0 total)
810 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 12 64':

         2 611 794      dTLB-loads                                                  
               725      dTLB-load-misses          #    0,03% of all dTLB cache hits 
               725      dTLB-misses               #    0,03% of all dTLB cache hits 
               252      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,001497991 seconds time elapsed

0.00 misses per repetition (0 total)
1657 us

 Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 24 64':

         5 017 697      dTLB-loads                                                  
               697      dTLB-load-misses          #    0,01% of all dTLB cache hits 
               697      dTLB-misses               #    0,01% of all dTLB cache hits 
               293      dTLB-store-misses                                           
     <not counted>      dtlb_store_misses.miss_causes_a_walk                                     (0,00%)

       0,002352180 seconds time elapsed

Offsets 32/64/128 don't increase the TLB misses when going over the associativity size. Maybe the Skylake TLB prefetcher got better and can avoid the misses when it recognizes a certain pattern?

btolsch commented 5 years ago

I'm only surprised by the 128 offset results; 16,128 and 24,128 should cause increased misses. The other thing I don't understand though, (and maybe it's not important) is what is a TLB store miss? My understanding is that normal cache access can be read or write, but a TLB access would also be a read in order to do address translation. A TLB read miss would then cause it to fetch the page table and populate that TLB entry; that would be the only way to write to the TLB, and calling that a "miss" doesn't make sense. So I'm probably missing something there.

I wouldn't expect amazing prefetcher performance on the 128 offset examples since the 1536,1 is still incurring some misses.

The only other thing I can think of is that the perf stats for Skylake might be looking at the L1 TLB instead? But that should've had much more dramatically different results anyway, so that doesn't really make sense either.

When I have more time, I will try to look through the Intel manuals to see if I can find anything that would explain this. That could be a while though.

Kobzol commented 5 years ago

IMO TLB store miss is a write access that didn't find the page in TLB and TLB load miss is a read access that didn't find the page in TLB.

I looked briefly into the manual and something that may affect it is hyperthreading, because the TLB will be partitioned differently with HT active. However if I understand it correctly, it probably only affects the ITLB entries.

Btw I found out that the PRIVATE/SHARED fiasco only happened on my home notebook - the program doesn't crash and returns the same thing, but with almost no misses when MAP_PRIVATE is used. I finally found the culprit. Turns out it was caused by transparent huge pages, which were set to be [always] used for such big allocations on my notebook.

Kobzol / hardware-effects

Add TLB aliasing example #7