AcademySoftwareFoundation / OpenTimelineIO

Open Source API and interchange format for editorial timeline information.
http://opentimeline.io
Apache License 2.0
1.4k stars 276 forks source link

Segmentation fault on pytest exit when creating a SerializableObjectWithMetadata #1474

Open ssteinbach opened 1 year ago

ssteinbach commented 1 year ago

A user reported that there is a segmentation fault when running pytest + objects with metadata + the "fs" (fake filesystem) capture system.

Repro:

import opentimelineio as otio

# Does not segfault if you comment out this "test"
def test_empty_fs(fs):
    pass

def test_create_clip(capsys):
    # can be child classes of SerializableObjectWithMetadata (Clip, Gap, etc),
    # however other otio C++ classes appear to be ok (RationalTime, TimeRange,
    # TimeTransform)
    otio.core.SerializableObjectWithMetadata(
        # also does not segfault if you comment this out
        metadata={}
    )

Output:

❯ pytest otio_pytest_seg_fault.py --disable-warnings
============================================== test session starts ===============================================
platform darwin -- Python 3.8.13, pytest-7.2.0, pluggy-1.0.0
rootdir: /private/var/tmp
plugins: pyfakefs-5.0.0
collected 2 items

otio_pytest_seg_fault.py ..                                                                                [100%]

========================================== 2 passed, 1 warning in 0.09s ==========================================
zsh: segmentation fault  pytest otio_pytest_seg_fault.py --disable-warnings

(The warning is related to our use of imp and appears to be a red herring)

JeanChristopheMorinPerso commented 1 year ago

When OTIO is built in debug mode, we get a little bit more details:

Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: finalizing (tstate=0x000055649b90a060)

Aborted (core dumped)
JeanChristopheMorinPerso commented 1 year ago

The fact that it crashes it metadata={} is given as an argument kind of suggests me it has to do with the round-trip in https://github.com/AcademySoftwareFoundation/OpenTimelineIO/blob/main/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp#L99-L121. This part of the code is pretty painful to debug...

JeanChristopheMorinPerso commented 1 year ago
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7e3d6b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7ded958 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7dd753d in __GI_abort () at abort.c:79
#4  0x00007ffff7af561f in fatal_error_exit (status=-1) at Python/pylifecycle.c:2553
#5  fatal_error (fd=<optimized out>, header=header@entry=1, prefix=prefix@entry=0x7ffff7ce76d0 <__func__.6.lto_priv.2> "PyThreadState_Get", 
    msg=msg@entry=0x7ffff7c8e200 "the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)", status=status@entry=-1)
    at Python/pylifecycle.c:2664
#6  0x00007ffff7af59fe in _Py_FatalErrorFunc (func=0x7ffff7ce76d0 <__func__.6.lto_priv.2> "PyThreadState_Get", 
    msg=0x7ffff7c8e200 "the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)") at Python/pylifecycle.c:2750
#7  0x00007ffff7af5a51 in _Py_FatalError_TstateNULL (func=<optimized out>) at Python/ceval.c:271
#8  0x00007ffff7a63c30 in _Py_EnsureFuncTstateNotNULL (tstate=0x0, func=0x7ffff7ce76d0 <__func__.6.lto_priv.2> "PyThreadState_Get") at ./Include/internal/pycore_pystate.h:95
#9  PyThreadState_Get () at Python/pystate.c:1015
#10 tupledealloc (op=0x7ffff5f1be60) at Objects/tupleobject.c:272
#11 0x00007ffff7b57c1c in _Py_Dealloc (op=<optimized out>) at Objects/object.c:2300
#12 _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#13 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#14 code_dealloc (co=0x7ffff5f3f9f0) at Objects/codeobject.c:658
#15 0x00007ffff7b53557 in _Py_Dealloc (op=<optimized out>) at Objects/object.c:2300
#16 _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#17 func_clear (op=0x7ffff5f1e710) at Objects/funcobject.c:626
#18 func_dealloc (op=0x7ffff5f1e710) at Objects/funcobject.c:648
#19 0x00007ffff581a6a5 in _Py_DECREF (op=0x7ffff5f1e710) at /usr/include/python3.10/object.h:500
#20 0x00007ffff581bbca in _Py_XDECREF (op=0x7ffff5f1e710) at /usr/include/python3.10/object.h:567
#21 0x00007ffff581cfcf in pybind11::handle::dec_ref() const & (this=0x7ffff5b0a130 <_value_to_any>)
    at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/deps/pybind11/include/pybind11/pytypes.h:260
#22 0x00007ffff581d0b0 in pybind11::object::~object (this=0x7ffff5b0a130 <_value_to_any>, __in_chrg=<optimized out>)
    at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/deps/pybind11/include/pybind11/pytypes.h:323
#23 0x00007ffff7deffa5 in __run_exit_handlers (status=status@entry=0, listp=0x7ffff7f8d760 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:113
#24 0x00007ffff7df0120 in __GI_exit (status=status@entry=0) at exit.c:143
#25 0x00007ffff7aa8699 in Py_Exit (sts=0) at Python/pylifecycle.c:2862
#26 0x00007ffff7a9f910 in handle_system_exit () at Python/pythonrun.c:775
#27 0x00007ffff7c1861c in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x55555555ed50) at Python/pythonrun.c:785
#28 PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:880
#29 0x00007ffff7aa10e7 in PyErr_Print () at Python/pythonrun.c:886
#30 _PyRun_SimpleFileObject (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffe308) at Python/pythonrun.c:462
#31 0x00007ffff7aa1a6e in _PyRun_AnyFileObject (fp=0x55555555a470, filename=0x7ffff74424f0, closeit=1, flags=0x7fffffffe308) at Python/pythonrun.c:90
#32 0x00007ffff7c0ea6d in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff74424f0, program_name=0x7ffff74b4810) at Modules/main.c:353
#33 pymain_run_file (config=0x5555555855a0) at Modules/main.c:372
#34 pymain_run_python (exitcode=0x7fffffffe304) at Modules/main.c:587
#35 Py_RunMain () at Modules/main.c:666
#36 0x00007ffff7be3c9b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#37 0x00007ffff7dd8290 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=3, argv=argv@entry=0x7fffffffe538) at ../sysdeps/nptl/libc_start_call_main.h:58
#38 0x00007ffff7dd834a in __libc_start_main_impl (main=0x555555555120 <main>, argc=3, argv=0x7fffffffe538, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffe528) at ../csu/libc-start.c:381
#39 0x0000555555555045 in _start () at ../sysdeps/x86_64/start.S:115

Alright, with GDB atatched, we get this. The interesting part of that backtrace is:

#21 0x00007ffff581cfcf in pybind11::handle::dec_ref() const & (this=0x7ffff5b0a130 <_value_to_any>)
    at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/deps/pybind11/include/pybind11/pytypes.h:260
#22 0x00007ffff581d0b0 in pybind11::object::~object (this=0x7ffff5b0a130 <_value_to_any>, __in_chrg=<optimized out>)
    at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/deps/pybind11/include/pybind11/pytypes.h:323

Notice the _value_to_any, which confirms my suspicion, or it's at least the region of code I highlighted in my previous reply. _value_to_any is the python function that is called from the C+++ side to convert the python type given as the metadata argument into a PyANy (Python) and then returned which then converts the PyAny into a C++ PyAny (automatically by pybind11).

JeanChristopheMorinPerso commented 1 year ago

Some more information:

https://github.com/AcademySoftwareFoundation/OpenTimelineIO/blob/main/src/py-opentimelineio/opentimelineio/core/_core_utils.py#L61 is executed properly, which means that at least opentimelineio.core._core_utils._value_to_any was called and it returns.

Also found https://github.com/pytest-dev/pyfakefs/issues/693 which kind of means that pyfakefs can break stuff. Looking at the pyfakefs code, it seems extremely intrusive.

JeanChristopheMorinPerso commented 1 year ago

By doing a small change, we can verify if _value_to_any actually returns something or it it crashes while converting the return value:

diff --git a/src/py-opentimelineio/opentimelineio-bindings/otio_serializableObjects.cpp b/src/py-opentimelineio/opentimelineio-bindings/otio_serializableObjects.cpp
index b3338eb..4f37429 100644
--- a/src/py-opentimelineio/opentimelineio-bindings/otio_serializableObjects.cpp
+++ b/src/py-opentimelineio/opentimelineio-bindings/otio_serializableObjects.cpp
@@ -188,7 +188,8 @@ static void define_bases1(py::module m) {
     py::class_<SOWithMetadata, SerializableObject,
                managing_ptr<SOWithMetadata>>(m, "SerializableObjectWithMetadata", py::dynamic_attr())
         .def(py::init([](std::string name, py::object metadata) {
-                    return new SOWithMetadata(name, py_to_any_dictionary(metadata));
+                    auto asd = py_to_any_dictionary(metadata);
+                    return new SOWithMetadata(name, asd);
                 }),
             py::arg_v("name"_a = std::string()),
             py::arg_v("metadata"_a = py::none()))
diff --git a/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp b/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp
index c327374..55ed8ea 100644
--- a/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp
+++ b/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp
@@ -102,7 +102,8 @@ static void py_to_any(py::object const& o, any* result) {
         _value_to_any = core.attr("_value_to_any");
     }

-    result->swap(_value_to_any(o).cast<PyAny*>()->a);
+    py::object tmp = _value_to_any(o);
+    result->swap(tmp.cast<PyAny*>()->a);
 }

 AnyDictionary py_to_any_dictionary(py::object const& o) {

Then in GDB:

(gdb) b otio_utils.cpp:105
(gdb) b otio_utils.cpp:106
(gdb) b otio_utils.cpp:121
(gdb) b otio_serializableObjects.cpp:192
(gdb) run .venv/bin/pytest tests/test_asd.py
Starting program: /home/jcmorin/jcmenv/aswf/OpenTimelineIO/.venv/bin/python .venv/bin/pytest tests/test_asd.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
================================================================================== test session starts ==================================================================================
platform linux -- Python 3.10.8, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/jcmorin/jcmenv/aswf/OpenTimelineIO
plugins: cov-4.0.0, pyfakefs-5.0.0
collected 2 items                                                                                                                                                                       

tests/test_asd.py .
Breakpoint 1, py_to_any (o=..., result=0x7fffffff9400) at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp:105
105     py::object tmp = _value_to_any(o);
(gdb) l
100     if (_value_to_any.is_none()) {
101         py::object core = py::module::import("opentimelineio.core");
102         _value_to_any = core.attr("_value_to_any");
103     }
104 
105     py::object tmp = _value_to_any(o);
106     result->swap(tmp.cast<PyAny*>()->a);
107 }
108 
109 AnyDictionary py_to_any_dictionary(py::object const& o) {
(gdb) c
Continuing.

Breakpoint 2, py_to_any (o=..., result=0x7fffffff9400) at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp:106
106     result->swap(tmp.cast<PyAny*>()->a);
(gdb) l
101         py::object core = py::module::import("opentimelineio.core");
102         _value_to_any = core.attr("_value_to_any");
103     }
104 
105     py::object tmp = _value_to_any(o);
106     result->swap(tmp.cast<PyAny*>()->a);
107 }
108 
109 AnyDictionary py_to_any_dictionary(py::object const& o) {
110     if (o.is_none()) {
(gdb) c
Continuing.

Breakpoint 3, py_to_any_dictionary (o=...) at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/py-opentimelineio/opentimelineio-bindings/otio_utils.cpp:121
121     return safely_cast_any_dictionary_any(a);
(gdb) l
116     if (!compare_typeids(a.type(), typeid(AnyDictionary))) {
117         throw py::type_error(string_printf("Expected an AnyDictionary (i.e. metadata); got %s instead",
118                                            type_name_for_error_message(a).c_str()));
119     }
120 
121     return safely_cast_any_dictionary_any(a);
122 }
123 
124 py::object any_to_py(any const& a, bool top_level) {
125     std::type_info const& tInfo = a.type();
(gdb) c
Continuing.

Breakpoint 4, operator() (__closure=0x5555559f63d8, name="", metadata=...) at /home/jcmorin/jcmenv/aswf/OpenTimelineIO/src/py-opentimelineio/opentimelineio-bindings/otio_serializableObjects.cpp:192
192                     return new SOWithMetadata(name, asd);
(gdb) l
187 
188     py::class_<SOWithMetadata, SerializableObject,
189                managing_ptr<SOWithMetadata>>(m, "SerializableObjectWithMetadata", py::dynamic_attr())
190         .def(py::init([](std::string name, py::object metadata) {
191                     auto asd = py_to_any_dictionary(metadata);
192                     return new SOWithMetadata(name, asd);
193                 }),
194             py::arg_v("name"_a = std::string()),
195             py::arg_v("metadata"_a = py::none()))
196         .def_property_readonly("metadata", [](SOWithMetadata* s) {

So _value_to_any returns a py::object as expected and the cast works, then safely_cast_any_dictionary_any is called and seems to be working. Then the breakpoint at line 192 is hit. So it either crashes while creating SOWithMetadata or hile returning SOWithMetadata. Or something like that.

mrbean-bremen commented 10 months ago

Stumbled over this issue today, and created a respective issue in pyfakefs. I cannot promise a fix so far (have to understand the problem first), but it seems to have something to with the caching of patched modules. Clearing the fs cache before running the problematic code can serve as an (ugly) workaround at the moment:

def test_create_clip(capsys, fs):
    fs.clear_cache()
    otio.core.SerializableObjectWithMetadata(
        metadata={}
    )
jminor commented 10 months ago

Thanks! We will test this workaround to see if it helps.

mrbean-bremen commented 10 months ago

Alternatively, you could use a fixture without caching instead of fs, e.g.:

import pytest
from pyfakefs.fake_filesystem_unittest import Patcher

@pytest.fixture
def fs_no_cache():
    with Patcher(use_cache=False) as patcher:
        yield patcher.fs

...
def test_empty_fs(fs_no_cache):
    pass

Caching is a performance feature that is only relevant if you run a larger number of tests with fs. This is the first time I see a side-effect from caching, but I had added the possibility to switch it off at the time just in case...

mrbean-bremen commented 10 months ago

Ok, I've made a patch release, please check if your tests work with the current version!

JeanChristopheMorinPerso commented 10 months ago

Nice, thanks @mrbean-bremen !