adang1345 / delvewheel

Self-contained Python wheels for Windows
MIT License
116 stars 12 forks source link

Strange behaviour with Matplotlib wheels dependent on delvewheel version #52

Closed ianthomas23 closed 1 month ago

ianthomas23 commented 1 month ago

Matplotlib is experiencing some strange behaviour (segmentation faults) in wheels using some but not all versions of delvewheel, and we'd appreciate some help understanding what is going on and hopefully fixing it.

The Matplotlib 3.9.1 release was yanked because although Matplotlib-only tests worked fine, using it on Windows with some downstream libraries was causing segmentation faults. I believe the original issue is https://github.com/actions/runner-images/issues/10055, a change in use of C++ std::mutex by MSVC that can be worked around with a #define, but the shipping and use of MSVC DLLs (which I don't think any of us on Matplotlib fully understand) is not letting the problem go away.

The related Matplotlib issue is quite long, the most relevant comment is https://github.com/matplotlib/matplotlib/issues/28551#issuecomment-2273144251.

I think the original Matplotlib 3.9.1 used delvewheel 1.7.1. There was a period of about a week when the problem was magically solved for us which I think corresponds to delvewheel 1.7.2. Based on that we made a 3.9.1.post1 release, but unfortunately that used delvewheel 1.7.3 and we are back to experiencing some segmentation faults although only in a subset of situations that were problematic before.

The wheels we were happy with are at https://github.com/matplotlib/matplotlib/actions/runs/10227059586/job/28298087600, and wheels that are now problematic are at https://github.com/matplotlib/matplotlib/actions/runs/10272282149/job/28424163560. I have made a simple github repository to reproduce the problem at https://github.com/ianthomas23/mpl-test. The workflow to reproduce the failure is, from https://github.com/ianthomas23/mpl-test/blob/cc04a4efbaa68a7aeda35303de6ad74792b2c6f7/.github/workflows/test.yml#L119 this:

<create new python environment>
python -m pip install -v --no-binary=contourpy contourpy
python -m pip install --only-binary=:all: --pre --upgrade --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple matplotlib
python test_with.py

where test_with.py is

# Test with import of matplotlib

from contourpy import contour_generator
import matplotlib.pyplot as plt

print("START test_with.py")
cont_gen = contour_generator(z=[[0, 1], [2, 3]])
try:
    cont_gen.filled(2.0, 1.0)
except Exception as e:
    print("EXCEPTION HANDLER", e)
print("END test_with.py")

One can download the good and bad wheels from the above links and pip install those to confirm good/bad behaviour.

adang1345 commented 1 month ago

I took a look at the good and bad wheels and reproduced the problem locally. I have a hypothesis as to what's causing the problem. First, let me explain the relevant difference in delvewheel versions. In delvewheel 1.7.2, I added code that avoids renaming msvcp140.dll when it's vendored into the wheel, intending to fix a bug that was reported. Soon afterward, I discovered that the bug had an unrelated cause and reverted that change in delvewheel 1.7.3.

The good wheel contains msvcp140.dll version 14.16.27033.0. When test_with.py is run, this DLL doesn't actually get loaded because C:\Windows\System32\msvcp140.dll is earlier in the search path if it exists. On my machine, C:\Windows\System32\msvcp140.dll is version 14.40.33810.0, which is newer.

The bad wheel contains msvcp140-cb1364a8f14ec1d3687d6faef0fd327e.dll version 14.16.27033.0. When test_with.py is run, this DLL does get loaded because there is very unlikely to be another DLL on the system with the same name. This is where the crash is happening.

I don't think the renaming in itself is responsible for the crash. Rather, I think the crash is due to matplotlib bundling a too-old version of msvcp140.dll. Version 14.16.27033.0 was distributed with Visual Studio 2017, so if matplotlib is built with a newer Visual Studio version, then you run the risk of compatibility issues. I would suggest that you try bundling a newer version and seeing whether that makes a difference. delvewheel uses the PATH environment variable to find DLLs to bundle. Figure out where a newer version of msvcp140.dll exists on your build machine and prepend it to PATH at the time you call delvewheel, or use the --add-path command-line option to delvewheel.

ksunden commented 1 month ago

The version we ship is I believe simply the version that is on the Github runners.

The thing is we did not have a problem with mpl 3.9.0, only 3.9.1, despite them having the exact same version of msvc bundled.

I do think we had exceptionally poor timing with regards to the changes that happened here that may have muddled our understanding:

So really poor timing on both ends: we eliminated delvewheel just prior to a change that actually "fixed" it, and then we released right after the version that rebroke it.

So I guess my question to you is:

adang1345 commented 1 month ago

I'm not sure whether name mangling is related to the issue here. Maybe it has something to do with it, but I can't say for certain. I would suggest that you try two things. The first is, as I indicated, is to bundle a newer version of msvcp140.dll. I know that the test passed with matplotlib 3.9.0 with the same version, but I still think it's worth a try.

I checked the PATH value on the GitHub Actions Windows Server 2022 runner, and I see msvcp140.dll in the following locations, ordered based on PATH.

C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\8.0.422-5\x64\bin\msvcp140.dll
C:\Program Files\ImageMagick-7.1.1-Q16-HDRI\msvcp140.dll
C:\Windows\system32\msvcp140.dll
C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code\msvcp140.dll
C:\Program Files\LLVM\bin\msvcp140.dll

By default, delvewheel will pick up C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\8.0.422-5\x64\bin\msvcp140.dll because that is the first that is found in PATH, even though that's probably not the one you built against.

The second thing you can try is to add the flag --no-mangle msvcp140.dll to the delvewheel invocation. This will disable the name mangling.

ianthomas23 commented 1 month ago

@adang1345 Thanks for the explanation, it is really helpful. I am trying out delvewheel repair --add-path C:\\Windows\\system32 in matplotlib/matplotlib#28679 and it is looking promising so far.

QuLogic commented 1 month ago

Given this change in Python 3.8 (emphasis mine):

DLL dependencies for extension modules and DLLs loaded with ctypes on Windows are now resolved more securely. Only the system paths, the directory containing the DLL or PYD file, and directories added with add_dll_directory() are searched for load-time dependencies. Specifically, PATH and the current working directory are no longer used, and modifications to these will no longer have any effect on normal DLL resolution. If your application relies on these mechanisms, you should check for add_dll_directory() and if it exists, use it to add your DLLs directory while loading your library. Note that Windows 7 users will need to ensure that Windows Update KB2533623 has been installed (this is also verified by the installer).

I wonder if delvewheel should even be looking at PATH by default at all?

ianthomas23 commented 1 month ago

I am closing this as resolved. Passing --add-path did work, but now we are going with a different solution (matplotlib/matplotlib#28687) after receiving advice on exactly what linker flags to use from a colleague at Microsoft.

adang1345 commented 1 month ago

To help diagnose similar issues in the future, https://github.com/adang1345/delvewheel/commit/1f996d4b4e74dcb3af24d18440efa1ae5108df98 adds a warning if the bundled version of a Microsoft Visual C++ DLL is older than the version that the application is linked against.