use SSE movnti for streaming stores

travisdowns commented 5 years ago

Uses 64-bit movnti rather than _mm_stream_pi which is an older MMX instruction which are less well supported by the newest CPUs. Compiles to better code since the compiler doesn't need to shuffle the value into an MMX register.

The main impact is that small increments aren't as penalized anymore. Before, I got:

$ for i in {1..20}; do printf "%2d " $i ; ./write-combining 20 $i; done
 1 3693
 2 1920
 3 1381
 4 1043
 5 950
 6 1017
 7 892
 8 906
 9 2008
10 2271

Note that for increment 1 especially it is very slow, about 4x or 5x slower than ideal, because the scheduler gets clogged up with instructions and MLP is reduced. After, it looks like:

The first few increments are still slower the later ones, but by a smaller factor and overall the result more reflects the hardware. To get rid of all the performance degradation you would have to change how the loops are written, e.g., use a dedicated function for each increment value.

Kobzol commented 5 years ago

Thank you :) I didn't even realize that it's a MMX instruction. Since I want to primarily display the WC blowup in this case, I think that the flexibility of having the write regions is more useful than hardcoding it into the source.

travisdowns commented 5 years ago

In principle you can use template expansion (or macro magics) to avoid any actual hardcoding, but all the separate functions will be generated under the covers, but of course you still need a limit at some point. I agree it's not really important to illustrate what you want to show.

Kobzol commented 5 years ago

I considered macroing it, but that probably wouldn't increase the readability :) I'm thinking of using macros to generate large functions to demonstrate instruction cache misses. Thanks for your help and for explaining to me what actually might be happening with the write combining.

Kobzol / hardware-effects

use SSE movnti for streaming stores #8