Open vyuuui opened 5 months ago
Should also note that the SysV ABI marks all SIMD registers as "not preserved across function calls", and considering vector<int>::operator[]
is a function (albeit very simple), it's (yet another) instance that'll end up clobbering it. In the very least, the liveness of ymm0/ymm1 does not persist across the loop, let alone the whole function. All this to say, this benchmark is relying on undefined behavior to produce the correct sum values.
This post mistakenly assumes that you can expect registers not to be clobbered between two inline assembler blocks. If you inspect the output from
clang++
versusg++
, you'll notice that clang injectsvzeroupper
instructions between the inline assembler blocks insum_int_avx2
. Here's the two for comparison:Output from g++ (GCC) 14.1.1 20240522:
g++ -msse2 -mavx -mavx2 avxbench.cpp -o benchmark; objdump -D benchmark -Mintel | grep -A 48 '<_Z12sum_int_avx2RKSt6vectorIiSaIiEE>:'
Output from clang:
g++ -msse2 -mavx -mavx2 avxbench.cpp -o benchmark; objdump -D benchmark -Mintel | grep -A 48 '<__Z12sum_int_avx2RKNSt3__16vectorIiNS_9allocatorIiEEEE>:'
Take note of the following:
Instead of doing inlined assembly blocks, it's probably best to put the entire benchmark in a separate assembly routine.