cache-memory-bound: Effects after increment 16

fpnick commented 5 years ago

Hi again,

here's my plot for the cache-memory-bound-test:

screenshot from 2018-12-03 12-08-29

Do you have any explanation for the measurements after increment 32? The processor is a Xeon(R) Gold 6130 CPU @ 2.10GHz.

Kobzol commented 5 years ago

Well the explanation is simple, you're loading less cache lines and thus it takes less time :) With increments 1 - 16, you load N cache lines from the memory. With increments 17-31 you load slightly less, because you will once in a time skip a cache line. With increment 32 you effectively skip every other cache line, so you load just N/2 lines. With increment 64 you skip 3 lines in a row and thus load only N/4 lines etc. So naturally as the increment gets larger and larger, you load less and less stuff from memory and it will take less time.

Let's say you have cache lines of just 2 elements (red lines mark cache line boundaries).

With increment 1, you hit all cache lines. With increment 2, you hit all cache lines. With increment 3, you skip the third cache line (addresses 4 and 5). With increment 4, you skip the second and fourth cache lines.

The interesting part is from 1 to 16, where you execute up to 16x less multiplications, but you load the same number of cache lines and thus the runtime stays the same.

You can find similar graphs and a nice explanation here: http://igoro.com/archive/gallery-of-processor-cache-effects/.

fpnick commented 5 years ago

Makes sense. This explanation is too simple ;)

Thanks!

fpnick commented 5 years ago

... thinking about that: Then where does the stall in time come from between increment 128 and 512? And why is the improvement for increment 32 not as good as one would expect when loading just half the number of cache lines?

Thanks for the article, I'll dig into that...

Kobzol commented 5 years ago

Well those questions are much harder to answer :) Here are some quick ideas.

The time is not spent just in cache line loading, so the drop will not be to 50 % with increment 32.

Also the hardware prefetcher is preloading cache lines in advance, so until it understands the correct stride/increment that you're using, it will needlessly prefetch cache lines that will not get used. Actually it's possible that the 32 increment loads almost the same number of cache lines as the 1 increment because of the prefetcher and the time is lower simply because of less multiplications executed. I tried checking the prefetch count and it does go down with 32, but I'm not sure whether I'm checking the right perf counter.

I have no idea what might be causing the 128/256/512 increment times. This is picture from my work notebook: cache

As you can see I don't observe the same effect. But if you see it consistently, something is probably causing it. The way from DRAM to a register is long and there are layers upon layers of complicated stuff. Maybe some TLB effects (again a wild guess).

fpnick commented 5 years ago

Prefetching is the explanation my colleague was also tending towards.

I thought I'd give this a look in VTune: In the increment 16 case we have 13,500,405 loades / 19,000,570 stores / 14,000,840 LLC Misses Increment 32: 8,000,240 loades / 10,500,315 stores / 12,250,735 LLC Misses Increment 64: 2,500,075 loades / 10,500,315 stores / 4,500,270 LCC Misses

Looking at the number of loades and LLC Misses, some prefetching effect sounds reasonable imho. What's a little odd is the two identical numbers for stores in the 32 and 64 case (I double checked - I didn't type in the same number twice by accident... ;))

But I guess you're right - the deeper you dig into that, the more complicated it gets. Anderson's Law

Kobzol commented 5 years ago

I'm getting similar numbers with perf, LLC loads and misses go down, but LLC stores stay the same - only the L1 stores go down. I guess it makes sense, because the code loads the number, multiplies it and writes it back immediately, therefore it gets written to L1 while the cache line is still there and it gets evicted later. I'm not 100 % sure how to interpret the LLC-loads and LLC-stores metric though.

Performance counter stats for 'cache-memory-bound 16':

        17 647 683      LLC-loads                                                     (79,63%)
         2 093 760      LLC-stores                                                    (60,83%)
        45 882 616      L1-dcache-stores                                              (80,49%)
         9 853 836      LLC-misses                #   55,84% of all LL-cache hits     (59,54%)
        44 656 986      mem-stores                                                    (79,19%)

       0,285644904 seconds time elapsed

 Performance counter stats for 'cache-memory-bound 32':

        10 572 302      LLC-loads                                                     (79,40%)
         2 286 330      LLC-stores                                                    (59,30%)
        32 826 397      L1-dcache-stores                                              (79,89%)
         7 561 590      LLC-misses                #   71,52% of all LL-cache hits     (61,30%)
        34 176 100      mem-stores                                                    (79,83%)

       0,194476817 seconds time elapsed

 Performance counter stats for 'cache-memory-bound 64':

         5 790 982      LLC-loads                                                     (78,05%)
         2 229 465      LLC-stores                                                    (62,16%)
        29 797 805      L1-dcache-stores                                              (81,22%)
         5 819 607      LLC-misses                #  100,49% of all LL-cache hits     (59,78%)
        28 081 828      mem-stores                                                    (78,60%)

       0,127973339 seconds time elapsed

It's interesting that for increment 64 pretty much all of the loads are missed (LLC-misses here are mostly load misses, I checked).

travisdowns commented 5 years ago

With larger strides, everything is less efficient on a "per line basis" because you have to do relatively more things like page walks, and also prefetching is less efficient (since stops at page boundaries). E.g., there are 64 64-byte cache lines in a page, if you load only half of them you amortize the same page walk which brought in the page over half the number of lines, and this effect becomes more pronounced with strides like 512.

Also, even memory controllers and RAM itself is optimized for linear access, e.g., opening a page may bring 2048 bits or so to the sense amps, ready to be read more quickly (64 bytes at a time), so strides greater than 1 waste some of that effort since some lines will never be read.

BBI-YggyKing commented 4 years ago

I'm trying to replicate these sorts of effects. I wrote my own test, which boils down to pretty much the same inner loop. However, I'm really not seeing the flattening of performance between 1 and 32 byte increments, which surprises me. What I see instead is that processing time is approximately proportional to the number of data entries that get updated - as if cache effects did not exist. Can anyone shed light on my results?

Here is my runnable code on repl.it. Results are graphed in google sheets.

I get similar results running on my local desktop computer (Intel Xeon E5-1607 v3@3.10GHz).

Here's my inner loop. The loop termination condition lets me use negative step sizes, I wondered if this would affect pre-fetch behaviour (it didn't).

  void process_once(data_t* buffer, size_t n, size_t start, size_t step)
  {
      for (long i = start; (i >= 0) && (i < n); i += step)
      {
        buffer[i] = buffer[i] * 3;
      }
  }

Kobzol commented 4 years ago

I tried your code on my notebook and I got a flat curve in the middle. I would try running the inner loop for a single step value a lot of times, if you run it just once I would be skeptical of noise and stuff being left in the cache from the previous run etc.

But other than that, are you running it with optimizations (-O2 or -O3)? On the repl.it it compiles in debug and therefore the loop is much slower and the drop is more noticeable. I tried manually compiling it on the repl with -O3 (F1 -> Open shell) and got the following output:

8,14.77
16,11.15
24,10.10
32,10.62
40,9.49
48,8.89
56,8.92
64,8.54
72,9.17
80,8.57
88,8.39
96,8.41
104,8.16
112,7.87
120,7.96
128,7.71
144,6.91
160,6.63
176,6.13
192,5.87
208,5.48
224,5.28
240,4.79

Which seems reasonable.

BBI-YggyKing commented 4 years ago

Turning on optimization definitely makes a difference. Running locally on my desktop with optimization enabled (/O2 /Ox with Visual Studio command line tools) it runs substantially faster and relative numbers are somewhat different.

I don't think it's necessary to run each step size multiple times - I did this originally and the results were very consistent from one repeat to the next. I attribute this to the fact that I'm iterating a buffer that is large enough to effectively flush the cache each time I walk through it. To ensure this is the case I added a "flush" step which iterates the full buffer one byte at a time between each of the tests, which should put the cache in a consistent state prior to running each iteration.

I also tweaked the test to output the effective throughput of each iteration - how many MB/sec are being touched. My results indicate that the throughput drops pretty much linearly with step size at least to the 64 byte cache line size, at which points throughput levels out - which seems reasonable.

Kobzol / hardware-effects

cache-memory-bound: Effects after increment 16 #11