Created attachment 25475
Source files to reproduce the bug
When linking a program with a DLL using the /delayload switch, the first call
to a function defined in the DLL will get bad value for (at least one of) the
floating point parameters.
Attached are 2 sources file my_lib.cpp and my_exe.cpp to reproduce the bug.
They should be built as folow:
- "C:\Program Files\LLVM\bin\clang-cl.exe" my_lib.cpp /link /DLL /OUT:my_dll.dll
- "C:\Program Files\LLVM\bin\clang-cl.exe" /c my_exe.cpp /OUT:my_exe.obj
- "C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe
When running my_exe.exe, the output will be "1 0 3" instead of the expected "1
2 3".
The last step can be replaced with
"C:\Program Files (x86)\Microsoft Visual
Studio\2019\Professional\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64\link.exe"
my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe
to use link.exe with the same options or with
"C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib my_exe.obj /OUT:my_exe.exe
to use lld with /delayload. In both of those cases the resulting executable
will give the expected "1 2 3".
I believe the bug occurs because __delayLoadHelper2 (the function defined in
delayimp.lib that actually loads the DLL and locate the function we want to
call during the first usage) writes into the top of the stack space of its
caller (I don't know why, is it a weird Windows caling convention?) but the
thunk generated by lld doesn't that space.
Specifically, the thunk generated by lld (for x64) looks like this:
push rcx
push rdx
push r8
push r9
sub rsp,48h
movdqa xmmword ptr [rsp],xmm0
movdqa xmmword ptr [rsp+10h],xmm1
movdqa xmmword ptr [rsp+20h],xmm2
movdqa xmmword ptr [rsp+30h],xmm3
mov rdx,rax
lea rcx,[__xt_z+28h (01401C9E88h)]
call __delayLoadHelper2 (01401A3464h)
movdqa xmm0,xmmword ptr [rsp]
movdqa xmm1,xmmword ptr [rsp+10h]
movdqa xmm2,xmmword ptr [rsp+20h]
movdqa xmm3,xmmword ptr [rsp+30h]
add rsp,48h
pop r9
pop r8
pop rdx
pop rcx
jmp rax
(it allocates space on the stack and uses it to save the register prior to
calling __delayLoadHelper2 and restore them later)
Whereas the thunk generated by link.exe looked like that:
mov qword ptr [rsp+8],rcx
mov qword ptr [rsp+10h],rdx
mov qword ptr [rsp+18h],r8
mov qword ptr [rsp+20h],r9
sub rsp,68h
movdqa xmmword ptr [rsp+20h],xmm0
movdqa xmmword ptr [rsp+30h],xmm1
movdqa xmmword ptr [rsp+40h],xmm2
movdqa xmmword ptr [rsp+50h],xmm3
mov rdx,rax
lea rcx,[__DELAY_IMPORT_DESCRIPTOR_my_dll (0140435020h)]
call __delayLoadHelper2 (01400089C2h)
movdqa xmm0,xmmword ptr [rsp+20h]
movdqa xmm1,xmmword ptr [rsp+30h]
movdqa xmm2,xmmword ptr [rsp+40h]
movdqa xmm3,xmmword ptr [rsp+50h]
mov rcx,qword ptr [rsp+70h]
mov rdx,qword ptr [rsp+78h]
mov r8,qword ptr [rsp+80h]
mov r9,qword ptr [rsp+88h]
add rsp,68h
jmp __tailMerge_my_dll+77h (01402237B8h)
jmp rax
It looks very similar but, for some reason, it doesn't save the xmmX register
on the top of the stack like lld, it leave 32 bytes that __delayLoadHelper2 is
free to mess with.
Indeed, (at least on my machine), the first 2 instruction of __delayLoadHelper2
are:
mov qword ptr [rsp+10h],rbx
mov qword ptr [rsp+18h],rsi
which, if I'm not mistaken are writting into the stack space where xmm0 and
xmm1 were saved.
repro-bug-lld.zip
(500 bytes, application/x-zip-compressed)