AcademySoftwareFoundation / OpenColorIO

A color management framework for visual effects and animation.
https://opencolorio.org
BSD 3-Clause "New" or "Revised" License
1.74k stars 431 forks source link

python tests fail because failed GIL interaction #1939

Open negril opened 5 months ago

negril commented 5 months ago

Tested with 2.3.0 and 2.3.1 on gentoo using python-3.11 and python-3.12.

Running ctest -R test_python fails with:

test_apply (CPUProcessorTest.CPUProcessorTest.test_apply) ... pybind11::handle::inc_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
If you are convinced there is no bug in your code, you can #define PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREFto disable this check. In that case you have to ensure this #define is consistently used for all translation units linked into a given pybind11 extension, otherwise there will be ODR violations.The failing pybind11::handle::inc_ref() call was triggered on a numpy.ndarray object.
ERROR
test_apply_src_dst (CPUProcessorTest.CPUProcessorTest.test_apply_src_dst) ... pybind11::handle::inc_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
If you are convinced there is no bug in your code, you can #define PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREFto disable this check. In that case you have to ensure this #define is consistently used for all translation units linked into a given pybind11 extension, otherwise there will be ODR violations.The failing pybind11::handle::inc_ref() call was triggered on a numpy.ndarray object.
ERROR
======================================================================
ERROR: test_apply (CPUProcessorTest.CPUProcessorTest.test_apply)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/tmp/paludis/media-libs-opencolorio-2.3.1/work/OpenColorIO-2.3.1/tests/python/CPUProcessorTest.py", line 302, in test_apply
    image = OCIO.PackedImageDesc(arr, 7, 3, 3)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: pybind11::handle::inc_ref() PyGILState_Check() failure.

======================================================================
ERROR: test_apply_src_dst (CPUProcessorTest.CPUProcessorTest.test_apply_src_dst)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/tmp/paludis/media-libs-opencolorio-2.3.1/work/OpenColorIO-2.3.1/tests/python/CPUProcessorTest.py", line 331, in test_apply_src_dst
    src_image = OCIO.PackedImageDesc(src_arr, 7, 3, 3)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: pybind11::handle::inc_ref() PyGILState_Check() failure.

----------------------------------------------------------------------
meimchu commented 5 months ago

I would like to spend some time looking into this. I want to say that there isn't an issue and we could use PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREF to disable this check. But that may not be ideal either. Fortunately the VFX Reference platform only goes up to python 3.11 for now so just to be sure, this is not a problem for that version of python, correct?

negril commented 5 months ago

Tested with 2.3.0 and 2.3.1 on gentoo using python-3.11 and python-3.12.

Python-3.11 is affected as well.

remia commented 5 months ago

It would be interesting to find out why this error isn't triggered in our CI, we test on both Python 3.11 and 3.12 when building the wheels. Maybe Python is compiled in a different way on Gentoo? Which version of Pybind11 are you using? OCIO is using a pretty old version I think at this point and newer versions are untested.

negril commented 5 months ago

Versions used are:

I'll update to python-3.11.8 and python-3.12.2 later today and retest, doubt it changes anything.

I can write a Dockerfile if that helps you reproduce it.

remia commented 5 months ago

Thanks for the information @negril, we are using Pybind11 2.9.2 so it could be interesting on our side to try with 2.11.1.

remia commented 5 months ago

Strangely enough, after explicitly using Pybind11 2.11.1 and Python 3.12 on macOS, Debug build, I can't get any of these assertions to trigger, even after adding obviously wrong acquire / release sequences in the code. Otherwise I can't see anything obviously wrong we are doing in the Pybind11 code for the time being.

Maybe a Docker image could help yeah, @meimchu did you manage to reproduce the issue on your side?

meimchu commented 5 months ago

Indeed I agree with maybe having a docker image could be helpful with debugging. I have Python 3.9.6 on my MacOS unfortunately though I did try to build it with the newer Pybind11 2.11.1 and did not encounter that issue.

meimchu commented 5 months ago

I have used pyenv to get both python 3.11.7 and python 3.12.1 on my MacOS. I also changed the build setting to use pybind 2.11.1 to build. I ran both ctest -R test_python and manually test with both versions of python on the OpenColorIOTestSuite.py. None of them appear to have return the same issues as all tests passed. Feels like a potential variable here might be the GCC version then? At least in the case of VFX reference platform, they’ve been sticking with GCC 11.2.1 for 2023/2024 for Linux distros. This mystery continues, it seems!

remia commented 3 months ago

I did hit the same issue while working on CI updates, could you try again on the main branch?