Closed ianthomas23 closed 1 month ago
I took a look at the good and bad wheels and reproduced the problem locally. I have a hypothesis as to what's causing the problem. First, let me explain the relevant difference in delvewheel versions. In delvewheel 1.7.2, I added code that avoids renaming msvcp140.dll
when it's vendored into the wheel, intending to fix a bug that was reported. Soon afterward, I discovered that the bug had an unrelated cause and reverted that change in delvewheel 1.7.3.
The good wheel contains msvcp140.dll
version 14.16.27033.0. When test_with.py
is run, this DLL doesn't actually get loaded because C:\Windows\System32\msvcp140.dll
is earlier in the search path if it exists. On my machine, C:\Windows\System32\msvcp140.dll
is version 14.40.33810.0, which is newer.
The bad wheel contains msvcp140-cb1364a8f14ec1d3687d6faef0fd327e.dll
version 14.16.27033.0. When test_with.py
is run, this DLL does get loaded because there is very unlikely to be another DLL on the system with the same name. This is where the crash is happening.
I don't think the renaming in itself is responsible for the crash. Rather, I think the crash is due to matplotlib bundling a too-old version of msvcp140.dll
. Version 14.16.27033.0 was distributed with Visual Studio 2017, so if matplotlib is built with a newer Visual Studio version, then you run the risk of compatibility issues. I would suggest that you try bundling a newer version and seeing whether that makes a difference. delvewheel uses the PATH
environment variable to find DLLs to bundle. Figure out where a newer version of msvcp140.dll
exists on your build machine and prepend it to PATH
at the time you call delvewheel, or use the --add-path
command-line option to delvewheel.
The version we ship is I believe simply the version that is on the Github runners.
The thing is we did not have a problem with mpl 3.9.0, only 3.9.1, despite them having the exact same version of msvc bundled.
I do think we had exceptionally poor timing with regards to the changes that happened here that may have muddled our understanding:
post0
release, which passed our testing, but due to problems with setuptools-scm
, we needed to retag..post1
, which is now also broken in the same way as before.So really poor timing on both ends: we eliminated delvewheel just prior to a change that actually "fixed" it, and then we released right after the version that rebroke it.
So I guess my question to you is:
I'm not sure whether name mangling is related to the issue here. Maybe it has something to do with it, but I can't say for certain. I would suggest that you try two things. The first is, as I indicated, is to bundle a newer version of msvcp140.dll
. I know that the test passed with matplotlib 3.9.0 with the same version, but I still think it's worth a try.
I checked the PATH
value on the GitHub Actions Windows Server 2022 runner, and I see msvcp140.dll
in the following locations, ordered based on PATH
.
C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\8.0.422-5\x64\bin\msvcp140.dll
C:\Program Files\ImageMagick-7.1.1-Q16-HDRI\msvcp140.dll
C:\Windows\system32\msvcp140.dll
C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code\msvcp140.dll
C:\Program Files\LLVM\bin\msvcp140.dll
By default, delvewheel
will pick up C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\8.0.422-5\x64\bin\msvcp140.dll
because that is the first that is found in PATH
, even though that's probably not the one you built against.
The second thing you can try is to add the flag --no-mangle msvcp140.dll
to the delvewheel
invocation. This will disable the name mangling.
@adang1345 Thanks for the explanation, it is really helpful. I am trying out delvewheel repair --add-path C:\\Windows\\system32
in matplotlib/matplotlib#28679 and it is looking promising so far.
Given this change in Python 3.8 (emphasis mine):
DLL dependencies for extension modules and DLLs loaded with ctypes on Windows are now resolved more securely. Only the system paths, the directory containing the DLL or PYD file, and directories added with add_dll_directory() are searched for load-time dependencies. Specifically, PATH and the current working directory are no longer used, and modifications to these will no longer have any effect on normal DLL resolution. If your application relies on these mechanisms, you should check for add_dll_directory() and if it exists, use it to add your DLLs directory while loading your library. Note that Windows 7 users will need to ensure that Windows Update KB2533623 has been installed (this is also verified by the installer).
I wonder if delvewheel
should even be looking at PATH
by default at all?
I am closing this as resolved. Passing --add-path
did work, but now we are going with a different solution (matplotlib/matplotlib#28687) after receiving advice on exactly what linker flags to use from a colleague at Microsoft.
To help diagnose similar issues in the future, https://github.com/adang1345/delvewheel/commit/1f996d4b4e74dcb3af24d18440efa1ae5108df98 adds a warning if the bundled version of a Microsoft Visual C++ DLL is older than the version that the application is linked against.
Matplotlib is experiencing some strange behaviour (segmentation faults) in wheels using some but not all versions of delvewheel, and we'd appreciate some help understanding what is going on and hopefully fixing it.
The Matplotlib 3.9.1 release was yanked because although Matplotlib-only tests worked fine, using it on Windows with some downstream libraries was causing segmentation faults. I believe the original issue is https://github.com/actions/runner-images/issues/10055, a change in use of C++
std::mutex
by MSVC that can be worked around with a#define
, but the shipping and use of MSVC DLLs (which I don't think any of us on Matplotlib fully understand) is not letting the problem go away.The related Matplotlib issue is quite long, the most relevant comment is https://github.com/matplotlib/matplotlib/issues/28551#issuecomment-2273144251.
I think the original Matplotlib 3.9.1 used delvewheel 1.7.1. There was a period of about a week when the problem was magically solved for us which I think corresponds to delvewheel 1.7.2. Based on that we made a 3.9.1.post1 release, but unfortunately that used delvewheel 1.7.3 and we are back to experiencing some segmentation faults although only in a subset of situations that were problematic before.
The wheels we were happy with are at https://github.com/matplotlib/matplotlib/actions/runs/10227059586/job/28298087600, and wheels that are now problematic are at https://github.com/matplotlib/matplotlib/actions/runs/10272282149/job/28424163560. I have made a simple github repository to reproduce the problem at https://github.com/ianthomas23/mpl-test. The workflow to reproduce the failure is, from https://github.com/ianthomas23/mpl-test/blob/cc04a4efbaa68a7aeda35303de6ad74792b2c6f7/.github/workflows/test.yml#L119 this:
where
test_with.py
isOne can download the good and bad wheels from the above links and
pip install
those to confirm good/bad behaviour.