JustasMasiulis / inline_syscall

Inline syscalls made easy for windows on clang
Apache License 2.0
657 stars 88 forks source link

clobber list #8

Open ordinary-github-user opened 3 months ago

ordinary-github-user commented 3 months ago
  1. if i do this with -mavx2
        #define CLOBBER_LIST "memory", "cc", "rcx", "r11", "tmm0", "tmm1", "tmm2", "tmm3", "tmm4", "tmm5", "tmm6", "tmm7",                                                                                                                                                                                          \
                             "zmm0", "zmm1", "zmm2", "zmm3", "zmm4", "zmm5", "zmm6", "zmm7", "zmm8", "zmm9", "zmm10", "zmm11", "zmm12", "zmm13", "zmm14", "zmm15", "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23", "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31"   \
                             "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm6", "ymm7", "ymm8", "ymm9", "ymm10", "ymm11", "ymm12", "ymm13", "ymm14", "ymm15", "ymm16", "ymm17", "ymm18", "ymm19", "ymm20", "ymm21", "ymm22", "ymm23", "ymm24", "ymm25", "ymm26", "ymm27", "ymm28", "ymm29", "ymm30", "ymm31",  \
                             "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15", "xmm16", "xmm17", "xmm18", "xmm19", "xmm20", "xmm21", "xmm22", "xmm23", "xmm24", "xmm25", "xmm26", "xmm27", "xmm28", "xmm29", "xmm30", "xmm31",

exe size increase by 2kb. all non-inlined functions that has syscall instructions push xmm6-15 on stack at beginning of function, pop them at the end of function. no random silent bugs

  1. if i do this with -mavx2
        #define CLOBBER_LIST "memory", "cc", "rcx", "r11", "tmm0", "tmm1", "tmm2", "tmm3", "tmm4", "tmm5", "tmm6", "tmm7",  \
                             "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5",                                                \
                             "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5"

no change to exe size. no random silent bugs

  1. if i do this with -mavx2
        #define CLOBBER_LIST "memory", "cc", "rcx", "r11", "tmm0", "tmm1", "tmm2", "tmm3", "tmm4", "tmm5", "tmm6", "tmm7"

no change to exe size. but i get a bunch of random silent bugs at code next to syscall instruction. i dont have minimum reproducible examples, slight changes to code the bug is gone, pain to debug

according to msdn

https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170#callercallee-saved-registers https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170#register-volatility-and-preservation

The x64 ABI considers the registers RAX, RCX, RDX, R8, R9, R10, R11, and XMM0-XMM5 volatile. When present, the upper portions of YMM0-YMM15 and ZMM0-ZMM15 are also volatile. On AVX512VL, the ZMM, YMM, and XMM registers 16-31 are also volatile. When AMX support is present, the TMM tile registers are volatile. Consider volatile registers destroyed on function calls unless otherwise safety-provable by analysis such as whole program optimization.

The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, and XMM6-XMM15 nonvolatile. They must be saved and restored by a function that uses them.

so my question is. how do i make the perfect bug free future proof clobber list with this? how do i clobber only upper half of a register? (also i might try -mavx512vf if i some day get a new cpu that support avx512)

JustasMasiulis commented 3 months ago

I've personally switched to modifying LLVM to not need the assembly at all (adding a calling convention, syscall instruction to the ISD tablegen, and modifying the call lowering for it).

        #define CLOBBER_LIST "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5"

Should ™️ suffice. AVX and AMX state is explicitly saved and restored by kernel code on use sites https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/floating-point-support-for-64-bit-drivers so in theory it shouldn't matter. rcx and r11 are outputs in the stubs that I have in repo so you shouldn't need to mark them as clobber either (IIRC there was subtly better codegen if they were marked as outputs instead of clobbers). Not marking the volatile XMM regs as clobber is a missight on my part, I think I only really saw issues with it on GCC, but maybe newer versions of clang also face some problems.

ordinary-github-user commented 3 months ago

i understand that x64 driver calling convention has a stricter rule than x64 usermode calling convention. i think im going to stick with x64 usermode calling convention to be safe, since normally syscall goes through ntdll/win32u, and they all use x64 usermode calling convention

i did some testing to see which registers are volatile

#define ASDF(reg) extern "C" __declspec(dllexport) void test_ ## reg(){asm volatile("":::#reg);}

ASDF(rax)
ASDF(rbx)
ASDF(rcx)
ASDF(rdx)
ASDF(rsi)
ASDF(rdi)
ASDF(rbp)
ASDF(rsp)
ASDF(r8)
ASDF(r9)
ASDF(r10)
ASDF(r11)
ASDF(r12)
ASDF(r13)
ASDF(r14)
ASDF(r15)
ASDF(tmm0)
ASDF(tmm1)
ASDF(tmm2)
ASDF(tmm3)
ASDF(tmm4)
ASDF(tmm5)
ASDF(tmm6)
ASDF(tmm7)
ASDF(xmm0)
ASDF(xmm1)
ASDF(xmm2)
ASDF(xmm3)
ASDF(xmm4)
ASDF(xmm5)
ASDF(xmm6)
ASDF(xmm7)
ASDF(xmm8)
ASDF(xmm9)
ASDF(xmm10)
ASDF(xmm11)
ASDF(xmm12)
ASDF(xmm13)
ASDF(xmm14)
ASDF(xmm15)
ASDF(xmm16)
ASDF(xmm17)
ASDF(xmm18)
ASDF(xmm19)
ASDF(xmm20)
ASDF(xmm21)
ASDF(xmm22)
ASDF(xmm23)
ASDF(xmm24)
ASDF(xmm25)
ASDF(xmm26)
ASDF(xmm27)
ASDF(xmm28)
ASDF(xmm29)
ASDF(xmm30)
ASDF(xmm31)
ASDF(ymm0)
ASDF(ymm1)
ASDF(ymm2)
ASDF(ymm3)
ASDF(ymm4)
ASDF(ymm5)
ASDF(ymm6)
ASDF(ymm7)
ASDF(ymm8)
ASDF(ymm9)
ASDF(ymm10)
ASDF(ymm11)
ASDF(ymm12)
ASDF(ymm13)
ASDF(ymm14)
ASDF(ymm15)
ASDF(ymm16)
ASDF(ymm17)
ASDF(ymm18)
ASDF(ymm19)
ASDF(ymm20)
ASDF(ymm21)
ASDF(ymm22)
ASDF(ymm23)
ASDF(ymm24)
ASDF(ymm25)
ASDF(ymm26)
ASDF(ymm27)
ASDF(ymm28)
ASDF(ymm29)
ASDF(ymm30)
ASDF(ymm31)
ASDF(zmm0)
ASDF(zmm1)
ASDF(zmm2)
ASDF(zmm3)
ASDF(zmm4)
ASDF(zmm5)
ASDF(zmm6)
ASDF(zmm7)
ASDF(zmm8)
ASDF(zmm9)
ASDF(zmm10)
ASDF(zmm11)
ASDF(zmm12)
ASDF(zmm13)
ASDF(zmm14)
ASDF(zmm15)
ASDF(zmm16)
ASDF(zmm17)
ASDF(zmm18)
ASDF(zmm19)
ASDF(zmm20)
ASDF(zmm21)
ASDF(zmm22)
ASDF(zmm23)
ASDF(zmm24)
ASDF(zmm25)
ASDF(zmm26)
ASDF(zmm27)
ASDF(zmm28)
ASDF(zmm29)
ASDF(zmm30)
ASDF(zmm31)

i compiled this with -msse2 -mavx2 -march=znver4, and looked at compiled code in ida pro

and see if the function save and restore the register inside function body.

  1. function save & restore register = register is non-volatile = register does not go in clobber list
  2. function does not save & restore register = register is volatile / register does not exist in current architecture = register goes in clobber list

is my reasoning correct here?

also i dont know which compiler flag is needed to generate amx code that use tmm registers

my conclusion is

sse2

    volatile xmm0-5
non-volatile xmm6-15

avx2

    volatile xmm0-5
non-volatile xmm6-15

    volatile ymm0-5
non-volatile ymm6-15

avx512

    volatile xmm0-5 xmm16-31
non-volatile xmm6-15

    volatile ymm0-5 ymm16-31
non-volatile ymm6-15

    volatile zmm0-5 zmm16-31
non-volatile zmm6-15

so, this is the clobber list (putting tmm registers here, since i dont know anything about amx, and msdn does say "When AMX support is present, the TMM tile registers are volatile")

*no need to check __SSE2__ __AVX2__ __AVX512F__, because clang ignore non-exist registers in clobber list

#define CLOBBER_LIST "memory", "cc", "rcx", "r11", "tmm0", "tmm1", "tmm2", "tmm3", "tmm4", "tmm5", "tmm6", "tmm7",                                                              \
                        "zmm0", "zmm1", "zmm2", "zmm3", "zmm4", "zmm5", "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23", "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31", \
                        "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm16", "ymm17", "ymm18", "ymm19", "ymm20", "ymm21", "ymm22", "ymm23", "ymm24", "ymm25", "ymm26", "ymm27", "ymm28", "ymm29", "ymm30", "ymm31", \
                        "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm16", "xmm17", "xmm18", "xmm19", "xmm20", "xmm21", "xmm22", "xmm23", "xmm24", "xmm25", "xmm26", "xmm27", "xmm28", "xmm29", "xmm30", "xmm31"

but maybe newer versions of clang also face some problems

this bug has been here since beginning. i use your library with clang since you first posted this on uc many years ago. im guessing the reason you didn't notice it is probably because the hashing you are doing next to the syscall instruction

i had to do NO_OPTIMIZE to local variables, or NO_INLINE as workaround

#define NO_OPTIMIZE(x) asm volatile("" : "+m"(const_cast<std::remove_const_t<std::remove_reference_t<decltype(x)>>&>(x)));

now i fixed the clobber list (probably), i finally dont have to use those workaround anymore

IIRC there was subtly better codegen if they were marked as outputs instead of clobbers

ah. so thats why. is that still the case? i feel like having volatile registers inside clobber list is much more readable than having them inside output list. i hope clang has fixed it, i hope

I've personally switched to modifying LLVM to not need the assembly at all (adding a calling convention, syscall instruction to the ISD tablegen, and modifying the call lowering for it).

having a compiler intrinsic for syscall would be nice. but i dont think im knowledgeable enough to mess with llvm yet