AcademySoftwareFoundation / OpenColorIO

A color management framework for visual effects and animation.
https://opencolorio.org
BSD 3-Clause "New" or "Revised" License
1.79k stars 455 forks source link

Platform Latest Actions -- Python tests seg-fault on Windows #2040

Open doug-walker opened 1 month ago

doug-walker commented 1 month ago

This issue is to solve the issue that is causing the GitHub Actions Platform Latest CI runs to fail due to a seg-fault in the Python 3.11 unit tests on Windows.

remia commented 1 month ago

I believe this is the same issue that was affecting the Wheel workflow, the issue disappear when one set OCIO_PYTHON_LOAD_DLLS_FROM_PATH to 0 which seem to indicate one the path added to the Python DLL directories cause the crash. On platform latest, the crash only occurs on Release build it seems.

Through some iterative tests, I have determined that one need to explicitly exclude these two paths from being added to the DLL directories for the workflow succeed:

'C:\\Program Files\\Microsoft Service Fabric\\bin\\Fabric\\Fabric.Code'
'C:\\Program Files\\LLVM\\bin'

Dumpbin reports the following dependencies on PyOpenColorIO:

python310.dll
OpenColorIO_2_4.dll
MSVCP140.dll
VCRUNTIME140.dll
VCRUNTIME140_1.dll
api-ms-win-crt-runtime-l1-1-0.dll
api-ms-win-crt-heap-l1-1-0.dll
api-ms-win-crt-math-l1-1-0.dll
api-ms-win-crt-stdio-l1-1-0.dll
api-ms-win-crt-filesystem-l1-1-0.dll
KERNEL32.dll

When PyOpenColorIO is loaded, the following DLL are loaded (as seen from dlltracer), obviously this is on a build that's working, otherwise the segfault prevent from having any log from dlltracer (even with python -u for un-buffered outputs, which seem to indicate the crash is quite early in the import process):

LoadLibrary \Device\HarddiskVolume2\Windows\System32\kernel.appcore.dll
LoadLibrary \Device\HarddiskVolume2\hostedtoolcache\windows\Python\3.11.9\x64\DLLs\_socket.pyd
LoadLibrary \Device\HarddiskVolume2\hostedtoolcache\windows\Python\3.11.9\x64\DLLs\select.pyd
LoadLibrary \Device\HarddiskVolume3\a\OpenColorIO\OpenColorIO\_build\src\bindings\python\PyOpenColorIO\PyOpenColorIO.pyd
LoadLibrary \Device\HarddiskVolume2\Windows\System32\msvcp140.dll
LoadLibrary \Device\HarddiskVolume2\hostedtoolcache\windows\Python\3.11.9\x64\vcruntime140_1.dll
LoadLibrary \Device\HarddiskVolume3\a\OpenColorIO\OpenColorIO\_build\src\OpenColorIO\Release\OpenColorIO_2_4.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\user32.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\win32u.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\gdi32.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\gdi32full.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\msvcp_win.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\imm32.dll

Looking at those two folders mentioned above, I see two interesting things:

I wonder if there is some kind of incompatibility going on here.

I have to admit, I find the OCIO_PYTHON_LOAD_DLLS_FROM_PATH where we add a lot of directories quite strange, I don't know how common this is for other ASWF projects or in the general Python bindings landscape. It seems quite fragile.

doug-walker commented 1 month ago

Regarding OCIO_PYTHON_LOAD_DLLS_FROM_PATH, that was added in this PR. Windows users of the bindings were having trouble and so Anders Langlands had proposed that we follow what OpenImageIO had done to work around the issue. So there is at least one other ASWF project doing something similar. But I agree with you Remi that it is fragile and, if it is causing other problems, I would not object to removing it and requiring Windows users to manage their path on their own.

KelSolaar commented 1 month ago

I would not object to removing it and requiring Windows users to manage their path on their own.

I would probably recommend people doing so but maybe document that it needs to be done somewhere.

Thanks for digging this @remia, I was almost there on my end and those 2 paths are in the short list I had bisected.

remia commented 1 month ago

Maybe we could also change the default behaviour to have OCIO_PYTHON_LOAD_DLLS_FROM_PATH disabled, then, if no feedback, remove it in a future version. I also wonder if there is a clever solution where PyOpenColorIO could know some of the DLL paths needed from the compilation (in case someone doesn't have all the dependencies statically linked into OpenColorIO mostly), but this would still be fragile.

The exact cause of the crash is still unclear here, maybe something like https://github.com/mxschmitt/action-tmate could help to dig further.

KelSolaar commented 1 month ago

Those dynamic DLLs might themselves have dependencies that also need to be added. As we build everything from scratch we can track and bring all the DLLs together where required and I would recommend that people do that or statically compile although I always had problems on Windows compiling the VFX Reference Platform statically.

remia commented 1 month ago

Yeah agree it can get rather complex. The dumpbin export I pasted above is supposed to print recursively all dependents DLLs if I understood correctly by the way, but that is only for statically linking dependencies.

zachlewis commented 1 month ago

Are you guys familiar with delvewheel? I'm not. But I've been using repairwheel for MacOS and Linux wheels in a custom cibuildwheel step; ostensibly, it would work for Windows too (via delvewheel)...

KelSolaar commented 1 month ago

I think that the issue here might be that some incompatible DLLs having the same name than some OCIO depends/it dependencies depend on are added to PATH and cause it/them to break rather than missing them. I think that the Platform build here works if there is only the OCIO path for the built lib added.

KelSolaar commented 1 month ago

E.g., build with only the OCIO built lib added: https://github.com/colour-science/OpenColorIO/actions/runs/10977085917, this should left us wondering what is being tried to be loaded when the two offending paths are added? Maybe some incompatible Windows API SDK libs?

remia commented 1 month ago

Is there benefits using explicit repairwheel step over the default auditwheel / delocate_wheel steps done by cibuildwheel? As for the Windows delvewheel, it's a good suggestion, but not sure I understand correctly, is it going to pull every dependents DLL into the wheel? What about the size increase?

Edit: Ah no, I think I understand, it only pulls non-system libraries. This would not fix the issue here as some systems library are added to the DLL directories and take precedence over the one present in standard locations. We wouldn't need it for our own wheel either I think as we build everything statically (except for the OpenColorIO.dll which gets added to DLL search path manually in our init.py). But could be useful for people building their own PyOpenColorIO bindings under Windows.

zachlewis commented 1 month ago

I do know that on MacOS and Linux, auditwheel and delocate also patch the module rpath with the relative paths to the dependencies. Obviously, that doesn't happen with Windows, but... I'm not sure what does happen with Windows, to be honest.

Is there benefits using explicit repairwheel step over the default auditwheel / delocate_wheel steps done by cibuildwheel?

In OIIO's case, somehow, either auditwheel or delocate_wheel actually managed to mangle otherwise perfectly working wheels. At a point, I'd actually provided an empty string for the custom repair-wheel command, just to prevent cibuildwheel's default repair-wheels command from breaking the wheels!

In the end, what I'm really doing is using the repair-wheel-command as an opportunity to slim down + modify the wheels. It's a bit heavy-handed, but the path of least resistance right now to getting rid of everything that can be gotten rid of (like duplicate "versioned" dynamic libraries that are normally symlinked to each other). I'm blasting everything except the module and the CLI binary executables; and then I'm pointing repairwheel back to the build directory to scoop up only and exactly what the cpython modules and binaries need.

remia commented 1 month ago

Yeah I remember I had some issues as well with the various repair steps, can't remember what was the issue now (though I think building the OCIO library itself as shared instead of static was somehow helping)! As for the size, what I remember doing in OCIO is to strip the binaries / libraries produced and disable SO versioning in CMake to get rid of the symlinks.

To come back to the main discussion, I guess there is no strong objection to disabling / removing OCIO_PYTHON_LOAD_DLLS_FROM_PATH then.