Kobzol / hardware-effects

Demonstration of various hardware effects.
MIT License
2.83k stars 159 forks source link

cache-memory-bound: varying results #10

Closed fpnick closed 5 years ago

fpnick commented 5 years ago

Hi,

I compiled the cache-memory-bound example with the Intel C++ compiler. I wanted to check what VTune says about how memory bound the program is.

The issue is that (in the increment 1 case) I saw various numbers reaching from 0% memory bound to 100% memory bound.

So I increased the size of the array by a factor of 4 but without luck.

Additionally, multiple executions of the program lead to pretty different timings:

12:58:33 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 902 12:58:57 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 881 12:59:25 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 884 12:59:27 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 887 12:59:29 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 887 12:59:31 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 741 12:59:33 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1 737

The processor is a Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (two of them) so I cannot rule out that the processor is changing it's core frequency at some point. But the different in the results looks somewhat suspicious, so I'd be interested in your opinion.

fpnick commented 5 years ago

Going back to the original size (32 1024 1024) but increasing the number of repetitions from 10 to 100, the results are much more stable:


13:20:11 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1829
13:20:18 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1801
13:20:20 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1800
13:20:23 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1797
13:20:26 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1806
13:20:28 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1795
13:20:30 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1795
13:20:33 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1819
13:20:36 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1812
13:20:39 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1810
13:20:41 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1820
13:20:44 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1812
13:20:46 fnick@leo3:cache-memory-bound (master)$ ./cache-memory-bound 1
1815

EDIT: With this version, I'm getting between 0% and 15% memory bound pipline slots according to VTune.

Kobzol commented 5 years ago

The number of repetitions should be a bit higher, you're right. I lowered it so that the Python benchmarks don't take so much time. I recommend you to turn off CPU scaling for benchmarking if you haven't already.

sudo cpupower frequency-set --governor performance

With VTune I get 13 % memory bound with increment 1 and around 74 % memory bound with increments 8-16, so that's pretty much around what I'd expect. It's stuck on DRAM bandwidth and L3 latency. The bandwidth is utilized a lot because it's both reading and writing (*=).

With increment 1 the memory bound metric should be low, because you're effectively doing 16 multiplications per loaded cache line, which is not that bad. With increment 16 you are doing just 1 multiplication per loaded cache line, which should result in a high memory bound metric.

Still I probably named this example in a confusing way, it's not 1:1 with what VTune states as a memory bound metric and I wasn't trying to create an executable that would show up as 99.9 % memory bound in VTune :)

All I wanted to show was that even though the program is doing several times less work, it takes pretty much the same time, because most of the time is spent on loading data from the memory and in this case if you load the same number of cache lines, the individual computation won't make a difference (as long as it's just a single multiplication, it would of course show up if there was more computation per byte of memory).

fpnick commented 5 years ago

Sure, the example works perfectly fine! :) I wasn't expecting a 1:1 match with VTune anyway, I just wanted to see the general effect in VTune, which I do. Basically I was "cross-checking" that I'm looking at the right metric for my real application.

The percentages you have seen coincide with what I'm getting.

I have a question regarding the higher increments, but I'll open a new issue for that.

PS: The results were produced on our cluster, so I don't have the option there to change any settings regarding CPU frequency etc.

Kobzol commented 5 years ago

If it's a cluster I would expect that both hyper-threading and CPU scaling are disabled anyway :)