Closed travisdowns closed 5 years ago
Thank you :) I didn't even realize that it's a MMX instruction. Since I want to primarily display the WC blowup in this case, I think that the flexibility of having the write regions is more useful than hardcoding it into the source.
In principle you can use template expansion (or macro magics) to avoid any actual hardcoding, but all the separate functions will be generated under the covers, but of course you still need a limit at some point. I agree it's not really important to illustrate what you want to show.
I considered macroing it, but that probably wouldn't increase the readability :) I'm thinking of using macros to generate large functions to demonstrate instruction cache misses. Thanks for your help and for explaining to me what actually might be happening with the write combining.
Uses 64-bit
movnti
rather than_mm_stream_pi
which is an older MMX instruction which are less well supported by the newest CPUs. Compiles to better code since the compiler doesn't need to shuffle the value into an MMX register.The main impact is that small increments aren't as penalized anymore. Before, I got:
Note that for increment 1 especially it is very slow, about 4x or 5x slower than ideal, because the scheduler gets clogged up with instructions and MLP is reduced. After, it looks like:
The first few increments are still slower the later ones, but by a smaller factor and overall the result more reflects the hardware. To get rid of all the performance degradation you would have to change how the loops are written, e.g., use a dedicated function for each increment value.