Fatal Python error: GC object already tracked

mw66 commented 3 years ago

I encounter this error today:

Fatal Python error: GC object already tracked

I googled a bit, seems it's related to threading, before I dive deeper, I want to confirm: PyD only run in the invoker's thread, and won't create any thread by its own?

Thanks.

ariovistus commented 3 years ago

It shouldn't create new threads

mw66 commented 3 years ago

It's kind of strange, I run my program under gdb, looks like it failed during gc_qalloc(), which is strange since the array allocated should have nothing to do with the pyd invocation in my program, i.e. here in my app logic, the stack trace has nothing related to pyd, but somehow it failed at _Py_Dealloc()!

So, somehow gc_qalloc() and pyd messed up somewhere:

(gdb) where
#0  0x00007ffff7bd5438 in _Py_Dealloc () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
#1  0x00007ffff7c00b2e in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
#2  0x00005555556fe332 in _D6deimos6python6object__T9Py_DECREFZQlUPSQBoQBkQBg8PyObjectZv (warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

op=0x7fff501647c0)
    at /home//.dub/packages/pyd-0.14.0/pyd/infrastructure/deimos/python/object.d:1191
warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

#3  0x0000555555cf2531 in pyd.pydobject.PydObject.~this() (warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

this=0x7fffdd2bcae0) at pydobject.d:79
#4  0x0000555555ebb487 in rt_finalize2 ()
#5  0x0000555555ed43c8 in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()
#6  0x0000555555ed0c16 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#7  0x0000555555ed2491 in _D2gc4impl12conservativeQw3Gcx10smallAllocMFNbmKmkxC8TypeInfoZPv ()
#8  0x0000555555ecf1a0 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#9  0x0000555555ed15f3 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#10 0x0000555555eb3280 in gc_qalloc ()
#11 0x0000555555eadb02 in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()
#12 0x00005555558e8274 in _D3std5array__T8AppenderTAyaZQo13ensureAddableMFmZ9__lambda3MFNaNbNeZS4core6memory8BlkInfo_ ()
    at /home///ldc2-1.25.1-linux-x86_64/bin/../import/std/array.d:3409

...

mw66 commented 3 years ago

Do you think the warning:

warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

tells something? gdb mess-up stack trace to point into pyd?

mw66 commented 3 years ago

Also this dtor call looks suspicious (in a different run):

#2  0x0000555555cf2531 in pyd.pydobject.PydObject.~this() (warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x555555cf2530 in read in psymtab, but not in symtab.)

this=0x7fffdcfb4400) at pydobject.d:79

mw66 commented 3 years ago

so:

~/project/ldc2-1.25.1-linux-x86_64/import/std/array.d 3409

            auto bi = (() @trusted => GC.qalloc(nbytes, blockAttribute!T))();

call into GC.qalloc, and into:

~/.dub/packages/pyd-0.14.0/pyd/infrastructure/pyd/pydobject.d 79

    /// Destructor. Calls Py_DECREF on PyObject reference.
    ~this() {
        if (m_ptr && !Py_Finalize_called) Py_DECREF(m_ptr);
        m_ptr = null;
    }

and then:

.dub/packages/pyd-0.14.0/pyd/infrastructure/deimos/python/object.d:1191

        op.ob_type.tp_dealloc(op);

which caused received signal SIGSEGV, Segmentation fault.

mw66 commented 3 years ago

I think this issue is related to gc & PyObject.dtor interaction:

  {
  py_init();
  PydObject p1 = py_eval("newA()", "moduleA");
  // some d code

  PydObject p2 = py_eval("newB()", "moduleB");
  // some d code

  PydObject p3 = py_eval("newC()", "moduleC");
  // some d code
  // seg fault here! when some D new array allocation triggered gc, 
  // and then segfault in one of the PydObject.dtor()

  p1.some_py_call(...);   // will call joblib.Parallel(n_jobs=-1, backend='multiprocessing')
  }

Then I tried to add GC.disable() at the very top before py_init(), and GC.enable() at the very bottom, that block at least run-thru.

Do you think this info can help identify the problem (probably with Py_DECREF)? or do you have any suggestion on how to further isolated the problem?

ariovistus commented 3 years ago

How are you using pyd? I am wondering if it is due to an incorrect struct in the deimos headers somewhere that is causing a bad pointer to be sent to Py_DECREF

mw66 commented 3 years ago

I'd say not so much special, just create some Python class objects, and call their methods, as I've illustrated in the previous comments code block, which I have updated again to add more details.

Maybe things worth attention:

PydObject p1, p2, p3 are from 3 different python files (modules).
the last call p1.some_py_call(...); // will call joblib.Parallel(n_jobs=-1, backend='multiprocessing')

Do you think these two are causing trouble?

Actually for (2), the segfault happens before p1.some_py_call(...) happens, so I don't think it's suspicious, also the same code worked fine before.

The new change is I added PydObject p2 = py_eval("newB()", "moduleB"); // some d code, but even this new change was working for the first few times I run it.

mw66 commented 3 years ago

encountered another SIGSEGV

the invoking line is:

  PydObject p3 = py_eval("newC()", "moduleC");

stack trace:

Thread 1 "live_trade" received signal SIGSEGV, Segmentation fault.                                                                                                                                     
0x00007ffff7adf498 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0                                                                                                                         
(gdb) up
#1  0x00007ffff7bf2b3b in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
(gdb) 
#2  0x00007ffff79fa656 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
(gdb) 
#3  0x00007ffff7ae56b4 in _PyObject_GC_Malloc () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
(gdb) 
#4  0x00007ffff7ae5967 in _PyObject_GC_New () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
(gdb) 
#5  0x00007ffff7bff641 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
(gdb) 
#6  0x00007ffff7c37dec in PyObject_GetIter () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0

mw66 commented 3 years ago

pyd.exception.PythonException@/home//.dub/packages/pyd-0.14.0/pyd/infrastructure/pyd/pydobject.d(483): 
AssertionError: 0 != True
----------------
        if (result is null) handle_exception();

the invoking line is:

  p3.some_py_method_with_no_args()();

I fixed the problem of this one, the AssertionError is on the Python side, maybe we can improve the pyd error message (better with Python filename and line number) to make this clear. I spent so much time try to debug on the D side instead.

mw66 commented 3 years ago

Just want to add more info regarding the memory issue: the Python side calls the joblib.Parallel

      arr = Parallel(n_jobs=-1, backend='multiprocessing')(...)

According to:

https://joblib.readthedocs.io/en/latest/parallel.html#working-with-numerical-data-in-shared-memory-memmapping

""" By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process. """

maybe this causing the mess-up with dlang-threading / dlang-gc / Python-gc?

mw66 commented 10 months ago

Mixing dlang-threading / dlang-gc / Python-gc is no fun. I changed my program structure to talk with Python using grpc (using C++ grpc wrapper on the client side).

ariovistus / pyd

Fatal Python error: GC object already tracked #151