Open ordinary-github-user opened 3 months ago
I've personally switched to modifying LLVM to not need the assembly at all (adding a calling convention, syscall instruction to the ISD tablegen, and modifying the call lowering for it).
#define CLOBBER_LIST "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5"
Should ™️ suffice. AVX and AMX state is explicitly saved and restored by kernel code on use sites https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/floating-point-support-for-64-bit-drivers so in theory it shouldn't matter. rcx
and r11
are outputs in the stubs that I have in repo so you shouldn't need to mark them as clobber either (IIRC there was subtly better codegen if they were marked as outputs instead of clobbers). Not marking the volatile XMM regs as clobber is a missight on my part, I think I only really saw issues with it on GCC, but maybe newer versions of clang also face some problems.
i understand that x64 driver calling convention has a stricter rule than x64 usermode calling convention. i think im going to stick with x64 usermode calling convention to be safe, since normally syscall goes through ntdll/win32u, and they all use x64 usermode calling convention
i did some testing to see which registers are volatile
#define ASDF(reg) extern "C" __declspec(dllexport) void test_ ## reg(){asm volatile("":::#reg);}
ASDF(rax)
ASDF(rbx)
ASDF(rcx)
ASDF(rdx)
ASDF(rsi)
ASDF(rdi)
ASDF(rbp)
ASDF(rsp)
ASDF(r8)
ASDF(r9)
ASDF(r10)
ASDF(r11)
ASDF(r12)
ASDF(r13)
ASDF(r14)
ASDF(r15)
ASDF(tmm0)
ASDF(tmm1)
ASDF(tmm2)
ASDF(tmm3)
ASDF(tmm4)
ASDF(tmm5)
ASDF(tmm6)
ASDF(tmm7)
ASDF(xmm0)
ASDF(xmm1)
ASDF(xmm2)
ASDF(xmm3)
ASDF(xmm4)
ASDF(xmm5)
ASDF(xmm6)
ASDF(xmm7)
ASDF(xmm8)
ASDF(xmm9)
ASDF(xmm10)
ASDF(xmm11)
ASDF(xmm12)
ASDF(xmm13)
ASDF(xmm14)
ASDF(xmm15)
ASDF(xmm16)
ASDF(xmm17)
ASDF(xmm18)
ASDF(xmm19)
ASDF(xmm20)
ASDF(xmm21)
ASDF(xmm22)
ASDF(xmm23)
ASDF(xmm24)
ASDF(xmm25)
ASDF(xmm26)
ASDF(xmm27)
ASDF(xmm28)
ASDF(xmm29)
ASDF(xmm30)
ASDF(xmm31)
ASDF(ymm0)
ASDF(ymm1)
ASDF(ymm2)
ASDF(ymm3)
ASDF(ymm4)
ASDF(ymm5)
ASDF(ymm6)
ASDF(ymm7)
ASDF(ymm8)
ASDF(ymm9)
ASDF(ymm10)
ASDF(ymm11)
ASDF(ymm12)
ASDF(ymm13)
ASDF(ymm14)
ASDF(ymm15)
ASDF(ymm16)
ASDF(ymm17)
ASDF(ymm18)
ASDF(ymm19)
ASDF(ymm20)
ASDF(ymm21)
ASDF(ymm22)
ASDF(ymm23)
ASDF(ymm24)
ASDF(ymm25)
ASDF(ymm26)
ASDF(ymm27)
ASDF(ymm28)
ASDF(ymm29)
ASDF(ymm30)
ASDF(ymm31)
ASDF(zmm0)
ASDF(zmm1)
ASDF(zmm2)
ASDF(zmm3)
ASDF(zmm4)
ASDF(zmm5)
ASDF(zmm6)
ASDF(zmm7)
ASDF(zmm8)
ASDF(zmm9)
ASDF(zmm10)
ASDF(zmm11)
ASDF(zmm12)
ASDF(zmm13)
ASDF(zmm14)
ASDF(zmm15)
ASDF(zmm16)
ASDF(zmm17)
ASDF(zmm18)
ASDF(zmm19)
ASDF(zmm20)
ASDF(zmm21)
ASDF(zmm22)
ASDF(zmm23)
ASDF(zmm24)
ASDF(zmm25)
ASDF(zmm26)
ASDF(zmm27)
ASDF(zmm28)
ASDF(zmm29)
ASDF(zmm30)
ASDF(zmm31)
i compiled this with -msse2 -mavx2 -march=znver4, and looked at compiled code in ida pro
and see if the function save and restore the register inside function body.
is my reasoning correct here?
also i dont know which compiler flag is needed to generate amx code that use tmm registers
my conclusion is
sse2
volatile xmm0-5
non-volatile xmm6-15
avx2
volatile xmm0-5
non-volatile xmm6-15
volatile ymm0-5
non-volatile ymm6-15
avx512
volatile xmm0-5 xmm16-31
non-volatile xmm6-15
volatile ymm0-5 ymm16-31
non-volatile ymm6-15
volatile zmm0-5 zmm16-31
non-volatile zmm6-15
so, this is the clobber list (putting tmm registers here, since i dont know anything about amx, and msdn does say "When AMX support is present, the TMM tile registers are volatile")
*no need to check __SSE2__
__AVX2__
__AVX512F__
, because clang ignore non-exist registers in clobber list
#define CLOBBER_LIST "memory", "cc", "rcx", "r11", "tmm0", "tmm1", "tmm2", "tmm3", "tmm4", "tmm5", "tmm6", "tmm7", \
"zmm0", "zmm1", "zmm2", "zmm3", "zmm4", "zmm5", "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23", "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31", \
"ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm16", "ymm17", "ymm18", "ymm19", "ymm20", "ymm21", "ymm22", "ymm23", "ymm24", "ymm25", "ymm26", "ymm27", "ymm28", "ymm29", "ymm30", "ymm31", \
"xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm16", "xmm17", "xmm18", "xmm19", "xmm20", "xmm21", "xmm22", "xmm23", "xmm24", "xmm25", "xmm26", "xmm27", "xmm28", "xmm29", "xmm30", "xmm31"
but maybe newer versions of clang also face some problems
this bug has been here since beginning. i use your library with clang since you first posted this on uc many years ago. im guessing the reason you didn't notice it is probably because the hashing you are doing next to the syscall instruction
i had to do NO_OPTIMIZE
to local variables, or NO_INLINE
as workaround
#define NO_OPTIMIZE(x) asm volatile("" : "+m"(const_cast<std::remove_const_t<std::remove_reference_t<decltype(x)>>&>(x)));
now i fixed the clobber list (probably), i finally dont have to use those workaround anymore
IIRC there was subtly better codegen if they were marked as outputs instead of clobbers
ah. so thats why. is that still the case? i feel like having volatile registers inside clobber list is much more readable than having them inside output list. i hope clang has fixed it, i hope
I've personally switched to modifying LLVM to not need the assembly at all (adding a calling convention, syscall instruction to the ISD tablegen, and modifying the call lowering for it).
having a compiler intrinsic for syscall would be nice. but i dont think im knowledgeable enough to mess with llvm yet
-mavx2
exe size increase by 2kb. all non-inlined functions that has syscall instructions push xmm6-15 on stack at beginning of function, pop them at the end of function. no random silent bugs
-mavx2
no change to exe size. no random silent bugs
-mavx2
no change to exe size. but i get a bunch of random silent bugs at code next to syscall instruction. i dont have minimum reproducible examples, slight changes to code the bug is gone, pain to debug
according to msdn
https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170#callercallee-saved-registers https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170#register-volatility-and-preservation
so my question is. how do i make the
perfect
bug free
future proof
clobber list with this? how do i clobber only upper half of a register? (also i might try-mavx512vf
if i some day get a new cpu that support avx512)