cwi-dis / cwipc

MIT License
15 stars 2 forks source link

Github actions windows builds are failing #124

Open jackjansen opened 2 months ago

jackjansen commented 2 months ago

There seem to be issues with cwipc_codec_python_tests and cwipc_kinect_python_tests.

jackjansen commented 2 months ago

Temporarily disabled the failing tests on Windows, but now the resulting nightly installer exhibits the same issues.

In a way this is good, because it allows local debugging under the Visual Studio debugger.

jackjansen commented 2 months ago

Downloaded the failing installer on Beelzebub (Windows 11): works like a charm. So maybe that is why I didn't see the problem with manually-built cwipc, because I've always used windows 11 to build?

I've also starting running the individual install-check apps (from libexec/cwipc) on Topinambur (win10) under the visual studio debugger.

cwipc_codec_check fails in the OpenCV DLL initialization routine: it's trying to do something to a mutex and gets a null pointer exception.

cwipc_realsense2_install_check fails in the realsense2 DLL initialization routine, in what appears to be the same msvcp140.dll mutex routine, mtx_do_lock()

jackjansen commented 2 months ago

Owwwww, this is very bad. The root cause appears to be an incompatible change that Microsoft has made to mutex constructors.

Found it in this thread: https://forum.juce.com/t/windows-crash-in-apvts-constructor/62039/13

Here is the Microsoft release note: https://github.com/microsoft/STL/wiki/Changelog#vs-2022-1710

Search for "Fixed mutex's constructor to be constexpr".

jackjansen commented 2 months ago

Updating MSVC Redist may do the trick: the working machine Beelzebub has 14.40.33810, the non-working Topinambur has 14.34.31931.

jackjansen commented 2 months ago

Investigating a bit further. The GitHub Windows runner has been updated to 14.40.33810 about 3 months ago.

So we should never have had the problem on the GitHub runner in the first place, only on user machines that have an older version installed.

This probably means that one of the third party packages we install has installed a private copy of msvcp140.dll and we are accidentally picking up that one.

jackjansen commented 2 months ago

This issue seems to be related: https://github.com/actions/runner-images/issues/10055

I've added a which msvcp140.dll to my windows action, and it shows

/c/hostedtoolcache/windows/Java_Temurin-Hotspot_jdk/8.0.422-5/x64/bin/msvcp140.dll

This issue seems to be related: https://github.com/actions/runner-images/issues/10055

jackjansen commented 2 months ago

Will attempt to apply https://github.com/OSGeo/gdal/commit/95d092d2c59961b7580add8d8736434a6c43e587 workaround.

jackjansen commented 2 months ago

No, I'm barking up the wrong tree. Or at least partially the wrong tree: I've now forcibly removed two "bad" copies of msvcp140.dll and the correct one is now foremost in $PATH but still having the issue.

Just realised that the problem only occurs in Python tests on the GitHub runners. And realised that something (MatPlotLib, I think) includes a slurped version of msvcp140 that it tries to load early.

jackjansen commented 2 months ago

There is a Matplotlib issue about this: https://github.com/matplotlib/matplotlib/issues/28551

jackjansen commented 2 months ago

Removing the matplotlib file didn't work.

See https://learn.microsoft.com/en-gb/sysinternals/downloads/procdump for new ideas.

jackjansen commented 4 weeks ago

It seems that Matplotlib 3.9.2 will fix the issue (see the issue linked above). Need to check.