adang1345 / delvewheel

Self-contained Python wheels for Windows
MIT License
115 stars 12 forks source link

vsruntime DLLs conflict after delvewheel repair #49

Closed Grantim closed 1 month ago

Grantim commented 1 month ago

I am building wheel package for pypi, that usually is OK, but after updating some thirdparties I've got an error:

>>> from meshlib import mrmeshpy as mm
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: DLL load failed while importing mrmeshpy: The specified module could not be found.

(I am using python 3.11.9)


Lets say I had good whl and now I have bad whl.

Both was repaired with delvewheel repair and both have same dlls including msvcp140-hash.dll, vcruntime140-hash.dll, vcruntime140_1-hash.dll (instead of hash there was some hash described here (Name Mangling section of delvewheel documentation))

So good whl works fine and bad one does not work. After hours of investigation I found out that using one of these options makes bad one work --no-dll "msvcp140.dll;vcruntime140_1.dll;vcruntime140.dll" or --no-mangle "msvcp140.dll;vcruntime140_1.dll;vcruntime140.dll".

It seems to me that there are some conflicts with original dlls in the system that produce that error: 1) --no-dll excludes msvcp140-hash.dll, vcruntime140-hash.dll, vcruntime140_1-hash.dll from my package, so it just uses system dlls 2) --no-mangle includes msvcp140.dll, vcruntime140.dll, vcruntime140_1.dll to my package without hash that makes system leave only one dll linked, that also prevents conflict

Interesting thing that good whl does not have this problem (it has same dependencies but some of them are older versions)


Have anyone faced something like this and what is the best way to fix it? (Would be nice if someone could explain why this issue happened, "dll conflicts" is just a theory based on symptoms)

Thanks!

Link to SO question

adang1345 commented 1 month ago

Can you upload the bad wheel here? I can take a look at it and see what the problem is.

Grantim commented 1 month ago

Sure! Thanks!

bad one good one

adang1345 commented 1 month ago

For the bad wheel, I traced the failure to the DLL tbb12-58efd192759dacc003e42fc3899b5ee8.dll, and I ran some experiments to try to figure out what was wrong with it.

I noticed that if I use ctypes.windll.kernel32.LoadLibraryExW() to load tbb12-58efd192759dacc003e42fc3899b5ee8.dll directly, it fails. If I load msvcp140-f1c90b395d7d901af7e16ef487237278.dll first, then loading tbb12-58efd192759dacc003e42fc3899b5ee8.dll succeeds. My first suspicion was that there was a bug in the name-mangling mechanism which was corrupting tbb12.dll and causing it to be unloadable directly. I analyzed tbb12-58efd192759dacc003e42fc3899b5ee8.dll but could not find anything wrong with it.

Then I wrote a program in C which uses LoadLibraryExW() to load tbb12-58efd192759dacc003e42fc3899b5ee8.dll, and I found that it sometimes failed and sometimes succeeded depending on the location of the C program and the flags that were passed to LoadLibraryExW(). As a control, I tried loading msvcp140-f1c90b395d7d901af7e16ef487237278.dll in the same manner, which succeeded in every scenario. This told me that tbb12-58efd192759dacc003e42fc3899b5ee8.dll is directly loadable in certain scenarios.

I was unable to find an obvious rule for why loading tbb12-58efd192759dacc003e42fc3899b5ee8.dll would succeed in some cases and fail in others. The only explanation I have is that at least one of the Microsoft Visual C++ DLLs such as vcruntime140.dll relies on process-global state that can be broken if two versions of that DLL are loaded in the same process. That is, if vcruntime140-hash1.dll and vcruntime140-hash2.dll are loaded into the same process, then the DLL loader no longer functions correctly. The fact that no one else has reported a similar issue and that past versions of your wheel worked fine would suggest that it is rare for the DLL loader to be broken in this way, and you happened to stumble upon the very specific set of circumstances that triggers the issue. Although I don't know for certain that this is what's causing your issue, but I think it's the most logical explanation.

I think, then, the solution would be for delvewheel to avoid name-mangling the Microsoft Visual C++ DLLs. I will work on making this change.

Grantim commented 1 month ago

Thanks for help! We will try to investigate further

Fedr commented 1 month ago

Thanks! I managed to reproduce the issue outside Python with the following C++ program:

#include <windows.h>

int main()
{
    AddDllDirectory( LR"(C:\Program Files\Python311)" );
    AddDllDirectory( LR"(C:\Users\user4\AppData\Roaming\Python\Python311\site-packages\meshlib.libs)" );
    HMODULE h = LoadLibraryExA( R"(C:\Users\user4\AppData\Roaming\Python\Python311\site-packages\meshlib\mrmeshpy.pyd)", NULL, LOAD_LIBRARY_SEARCH_DEFAULT_DIRS );
    FreeLibrary(h);
}

It looks 100% reproducible with the same error:

344c:d070 @ 1248997843 - LdrpProcessWork - ERROR: Unable to load DLL: "msvcp140-f1c90b395d7d901af7e16ef487237278.dll", Parent Module: "C:\Users\user4\AppData\Roaming\Python\Python311\site-packages\meshlib.libs\tbb12-58efd192759dacc003e42fc3899b5ee8.dll", Status: 0xc0000135

And additional finding is that if one copies the files

msvcp140-f1c90b395d7d901af7e16ef487237278.dll
vcruntime140-26a92e4fb4b73ddc824fe6616b0ea281.dll
vcruntime140_1-23cea287d52749969d25554c25715a49.dll 

from C:\Users\user4\AppData\Roaming\Python\Python311\site-packages\meshlib.libs\ to C:\Program Files\Python311\ then the importing of "bad" wheel (from meshlib import mrmeshpy as mm) and above C++ program both succeed.

So it looks that the problem is not in name mangling but in something else. What do you think?

adang1345 commented 1 month ago

I noticed a similar thing during my investigation, where I tried copying DLLs into different locations and seeing the result. But I could not determine what was special about tbb12-58efd192759dacc003e42fc3899b5ee8.dll. There are other DLLs that depend on msvcp140-f1c90b395d7d901af7e16ef487237278.dll, but I was able to load them with no problem.

And when I compared the good wheel with the bad wheel, I could not find a clear explanation for why one worked and the other did not.

Fedr commented 1 month ago

Thanks!

The only explanation I have is that at least one of the Microsoft Visual C++ DLLs such as vcruntime140.dll relies on process-global state that can be broken if two versions of that DLL are loaded in the same process.

To check this, I compiled above C++ small application with static linking of VS runtime, so no second vcruntime140.dll is loaded in the process before the error appears (and it still appears): DebugLog-51420.txt. So I think we can exclude this version.

Grantim commented 1 month ago

Just found one more thing: using this tool printed PE headers for tbb libraries good_pe.txt bad_pe.txt

Bad on before delvewheel bad_ref.txt

This line seems strange image

adang1345 commented 1 month ago

That's an issue with the PE Explorer tool. A lot of PE file analysis tools expect the DLL names to be in the same PE file section and fail to display the DLL information when that's not the case. DLL names are usually in the same PE section. However, the name-mangling mechanism can cause the DLL names to be placed into different PE sections. Many DLLs that have been name-mangled with delvewheel have this characteristic, and I have never seen this characteristic cause problems before. For instance, spdlog-9a52397535aaa1e58bd985a9de6110d0.dll loads fine, yet PE Explorer shows missing DLL names as well:

DLL NAME :
Characteristics : 0x42089508
OriginalFirstThunk : 0x42089508
TimeDateStamp : 0x4204E3D0
ForwarderChain : 0x4204E3D0
FirstThunk : 0x42074698

Imported Functions :

        __C_specific_handler
        __current_exception_context
        __current_exception
        memset
        memmove
        memcpy
        memcmp
        memchr
        _CxxThrowException
        __std_exception_destroy
        __std_exception_copy
        _purecall
        __std_type_info_destroy_list
        __std_terminate

DLL NAME :
Characteristics : 0x42089580
OriginalFirstThunk : 0x42089580
TimeDateStamp : 0x4204E3D0
ForwarderChain : 0x4204E3D0
FirstThunk : 0x42074710

Imported Functions :

        __CxxFrameHandler4

I have discovered something interesting, though when I debug the following code.

#include <stdio.h>
#include <Windows.h>

int main() {
    if (!AddDllDirectory(L"C:\\Users\\adang\\Downloads\\venv\\Lib\\site-packages\\meshlib.libs")) {
        printf("AddDllDirectory() failed\n");
    }
    SetLastError(0);
    HMODULE h;
    h = LoadLibraryExW(L"C:\\Users\\adang\\Downloads\\venv\\Lib\\site-packages\\meshlib.libs\\tbb12-58efd192759dacc003e42fc3899b5ee8.dll", NULL, LOAD_LIBRARY_SEARCH_DEFAULT_DIRS | LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR);
    DWORD err = GetLastError();
    printf("%p; %lu\n", h, err);
}

When I try to load tbb12-58efd192759dacc003e42fc3899b5ee8.dll, the DLL search path is wrong and is using the PATH environment variable instead. It would seem that the LOAD_LIBRARY_SEARCH_DEFAULT_DIRS | LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR flags are being ignored for some reason.

0ff4:19e0 @ 04991312 - LdrLoadDll - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\tbb12-58efd192759dacc003e42fc3899b5ee8.dll
0ff4:19e0 @ 04991312 - LdrpLoadDllInternal - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\tbb12-58efd192759dacc003e42fc3899b5ee8.dll
0ff4:19e0 @ 04991312 - LdrpResolveDllName - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\tbb12-58efd192759dacc003e42fc3899b5ee8.dll
0ff4:19e0 @ 04991312 - LdrpResolveDllName - RETURN: Status: 0x00000000
0ff4:19e0 @ 04991312 - LdrpMinimalMapModule - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\tbb12-58efd192759dacc003e42fc3899b5ee8.dll
ModLoad: 00007ff8`55080000 00007ff8`550d7000   C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\tbb12-58efd192759dacc003e42fc3899b5ee8.dll
0ff4:19e0 @ 04991312 - LdrpMinimalMapModule - RETURN: Status: 0x00000000
0ff4:19e0 @ 04991312 - LdrpFindKnownDll - ENTER: DLL name: msvcp140-f1c90b395d7d901af7e16ef487237278.dll
0ff4:19e0 @ 04991312 - LdrpFindKnownDll - RETURN: Status: 0xc0000135
0ff4:4e10 @ 04991312 - LdrpSearchPath - ENTER: DLL name: msvcp140-f1c90b395d7d901af7e16ef487237278.dll
0ff4:19e0 @ 04991312 - LdrpFindKnownDll - ENTER: DLL name: vcruntime140-26a92e4fb4b73ddc824fe6616b0ea281.dll
0ff4:19e0 @ 04991312 - LdrpFindKnownDll - RETURN: Status: 0xc0000135
0ff4:19e0 @ 04991312 - LdrpFindKnownDll - ENTER: DLL name: vcruntime140_1-23cea287d52749969d25554c25715a49.dll
0ff4:5284 @ 04991312 - LdrpSearchPath - ENTER: DLL name: vcruntime140-26a92e4fb4b73ddc824fe6616b0ea281.dll
0ff4:19e0 @ 04991328 - LdrpFindKnownDll - RETURN: Status: 0xc0000135
0ff4:25c4 @ 04991328 - LdrpSearchPath - ENTER: DLL name: vcruntime140_1-23cea287d52749969d25554c25715a49.dll
0ff4:19e0 @ 04991328 - LdrpPreprocessDllName - INFO: DLL api-ms-win-crt-heap-l1-1-0.dll was redirected to C:\WINDOWS\SYSTEM32\ucrtbase.dll by API set
0ff4:19e0 @ 04991328 - LdrpPreprocessDllName - INFO: DLL api-ms-win-crt-runtime-l1-1-0.dll was redirected to C:\WINDOWS\SYSTEM32\ucrtbase.dll by API set
0ff4:19e0 @ 04991328 - LdrpPreprocessDllName - INFO: DLL api-ms-win-crt-string-l1-1-0.dll was redirected to C:\WINDOWS\SYSTEM32\ucrtbase.dll by API set
0ff4:19e0 @ 04991328 - LdrpPreprocessDllName - INFO: DLL api-ms-win-crt-environment-l1-1-0.dll was redirected to C:\WINDOWS\SYSTEM32\ucrtbase.dll by API set
0ff4:19e0 @ 04991328 - LdrpPreprocessDllName - INFO: DLL api-ms-win-crt-stdio-l1-1-0.dll was redirected to C:\WINDOWS\SYSTEM32\ucrtbase.dll by API set
0ff4:4e10 @ 04991312 - LdrpComputeLazyDllPath - INFO: DLL search path computed: C:\Users\adang\source\repos\cppsandbox\x64\Release;C:\WINDOWS\SYSTEM32;C:\WINDOWS\system;C:\WINDOWS;.;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\Program Files\Python312\Scripts\;C:\Program Files\Python312\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Microsoft VS Code\bin;C:\MinGW\bin\;C:\MinGW\msys\1.0;C:\MinGW\ms

Compare this the log for when I load fmt-d8b924ba1577612827210a349cbf8c6e.dll, where the DLL search path is correct.

393c:5554 @ 06186234 - LdrLoadDll - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\fmt-d8b924ba1577612827210a349cbf8c6e.dll
393c:5554 @ 06186234 - LdrpLoadDllInternal - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\fmt-d8b924ba1577612827210a349cbf8c6e.dll
393c:5554 @ 06186234 - LdrpResolveDllName - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\fmt-d8b924ba1577612827210a349cbf8c6e.dll
393c:5554 @ 06186234 - LdrpResolveDllName - RETURN: Status: 0x00000000
393c:5554 @ 06186234 - LdrpMinimalMapModule - ENTER: DLL name: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\fmt-d8b924ba1577612827210a349cbf8c6e.dll
ModLoad: 00007ff8`c5a60000 00007ff8`c5a87000   C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs\fmt-d8b924ba1577612827210a349cbf8c6e.dll
393c:5554 @ 06186234 - LdrpMinimalMapModule - RETURN: Status: 0x00000000
393c:5554 @ 06186234 - LdrpFindKnownDll - ENTER: DLL name: msvcp140-f1c90b395d7d901af7e16ef487237278.dll
393c:5554 @ 06186234 - LdrpFindKnownDll - RETURN: Status: 0xc0000135
393c:1ea0 @ 06186234 - LdrpSearchPath - ENTER: DLL name: msvcp140-f1c90b395d7d901af7e16ef487237278.dll
393c:5554 @ 06186234 - LdrpFindKnownDll - ENTER: DLL name: vcruntime140-26a92e4fb4b73ddc824fe6616b0ea281.dll
393c:5554 @ 06186234 - LdrpFindKnownDll - RETURN: Status: 0xc0000135
393c:526c @ 06186234 - LdrpSearchPath - ENTER: DLL name: vcruntime140-26a92e4fb4b73ddc824fe6616b0ea281.dll
393c:5554 @ 06186234 - LdrpFindKnownDll - ENTER: DLL name: vcruntime140_1-23cea287d52749969d25554c25715a49.dll
393c:1ea0 @ 06186234 - LdrpComputeLazyDllPath - INFO: DLL search path computed: C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs;C:\Users\adang\source\repos\cppsandbox\x64\Release;C:\Users\adang\Downloads\venv\Lib\site-packages\meshlib.libs;C:\WINDOWS\SYSTEM32

I have no idea what would cause the DLL search path to be wrong with tbb12-58efd192759dacc003e42fc3899b5ee8.dll.

Grantim commented 1 month ago

Looks like we found the reason: Here in changes of onetbb https://github.com/oneapi-src/oneTBB/compare/v2021.11.0...v2021.13.0#diff-01fb64c473f1e70ed71d0d37e62daa12d13f2bd487df508e550a80f8ead82f9bR38

They added /DEPENDENTLOADFLAG:0x2000 that means: LOAD_LIBRARY_SAFE_CURRENT_DIRS

If this value is used, loading a DLL for execution from the current directory is only allowed if it is under a directory in the Safe load list.

Based on this

When the operating system resolves the statically linked imports of a module, it uses the default search order. Use the /DEPENDENTLOADFLAG option to specify a load_flags value that changes the search path used to resolve these imports. On supported operating systems, it changes the static import resolution search order, similar to what LoadLibraryEx does when using LOAD_LIBRARY_SEARCH parameters. For information on the search order set by load_flags, see Search order using LOAD_LIBRARY_SEARCH flags.

adang1345 commented 1 month ago

Thanks! I've learned something new today; this is the first time I've heard of the /DEPENDENTLOADFLAG option. Let me see whether this situation can be better handled in delvewheel, either by editing tbb12.dll to override the /DEPENDENTLOADFLAG option or outputting a warning if a problematic /DEPENDENTLOADFLAG option is detected.

Grantim commented 1 month ago

Thanks!