Correct C API Usage Logic for NO_GIL Multi-threading

colesbury / nogil

Multithreaded Python without the GIL

Other

2.91k stars 107 forks source link

Correct C API Usage Logic for NO_GIL Multi-threading #133

Open DanielLee343 opened 11 months ago

DanielLee343 commented 11 months ago

Hi Sam, I wonder what's the correct C API calling logic to implement a multi-threading feature in this no_gil CPython. I'm doing some hacking within Modules/gcmodule.c, that I want to mimic gc_get_objects_impl() but for each GC-traced container PyObject, I further call PyObject_GetIter() to obtain all it's inner objects references it holds. I face no problem when executing this logic between tstate = PyGILState_Ensure() and PyGILState_Release(tstate). But apparently it's holding the GIL.

If I don't hold the GIL, the PyObject_GetIter() internally calls _GC_Malloc(), and will seg faults in return mi_heap_calloc(tstate->heaps[mi_heap_tag_gc], nelem, elsize); since the heap structure is messed up.

Then I noticed on PEP 703, about the thread states. In this no_gil CPython 3.9 version, I guess it would be calling _PyThreadState_Swap() to set the thread state ATTACHED, like this:

void *inspect_module_objs(void *arg)
{
    PyThreadState *tstate = _PyThreadState_GET();
    struct _gilstate_runtime_state *gilstate = &tstate->interp->runtime->gilstate;
    if (_PyThreadState_Swap(gilstate, tstate) != NULL)
    {
        Py_FatalError("non-NULL old thread state");
    }
    // my threading logic that I don't want to hold GIL
    // for loop (...){PyObject_GetIter(each_op) ...}
    // ...
    PyEval_ReleaseThread(tstate);
}

This inspect_module_objs() is called by PyThread_start_new_thread(inspect_module_objs, args); However, it seg faults at _PyThreadState_Swap() since the tstate == NULL if you don't call PyGILState_Ensure(). If I hold the GIL before calling _PyThreadState_Swap() it then leads to Py_FatalError("non-NULL old thread state") somehow.

FYI, originally I asked on python forum here before they told be no_gil in 3.13 main stream is not completed, thus I would like to ask here. Thanks you.

colesbury commented 11 months ago

Hi @DanielLee343 - It would help if you further explain what you are trying to do and your high level motivations. For example, you say you want to mimic "gc_get_objects_impl" - why aren't you using gc.get_objects() or the other functions in the GC module? To see the right way to implement gc.get_objects() or gc.get_referents() is to look at their implementation. Calling PyObject_GetIter() does not sound correct, but it's hard to understand without further explanation.

Are you trying to do this in the nogil fork or in the CPython main branch (3.13 development)? As Terry wrote, the Cpython main branch nogil support is still in development and not ready for testing.

_PyThreadState_Swap, _PyThreadState_Attach and other functions that begin with an underscore are private functions. You should not call them directly and instead use the public APIs.

I face no problem when executing this logic between tstate = PyGILState_Ensure() and PyGILState_Release(tstate). But apparently it's holding the GIL.

What do you mean by "apparently it's holding the GIL"? As Terry wrote, there is no support for running without the GIL in the CPython main branch. It's still under development. In the nogil forks, it does not really hold the GIL, but the calls are still necessary. That's the whole bit about attaching and deatching. Any place you see in the docs that says that a thread must hold the gil, you should read as "thread must be attached", but the way you do it is the same: PyGILState_Ensure() or other functions like PyEval_RestoreThread() depending on the context.

DanielLee343 commented 11 months ago

@colesbury Thanks for clarifying. My high level goal is to do some statistical analysis of PyObjects in some Python applications during runtime, and use some semantics for the research. Thus, the primary goal is to obtain all PyObjects in some manner. Previously I was using the refchain DLL by enabling _PyObject_HEAD_EXTRA but since 1) it causes extra overhead, and 2) not ABI compatible, thus I turned to look into GC module.

Since the GC list already holds all container objects, inserted during initialization, I can loop through GC list, for each tracked PyObject, I do a recursive tracing, until each PyObject is not iterable.

I cannot directly use C implementation of gc.get_objects() or gc. get_referents() since neither gives me all PyObjects. get_objects() only returns container objects, and get_referents() returns the first-level "recursion" result, because of:

for (i = 0; i < PyTuple_GET_SIZE(args); i++)
{
...
    if (!_PyObject_IS_GC(obj))
        continue;
    traverse = Py_TYPE(obj)->tp_traverse;
    if (!traverse)
        continue;
...
}

For example, if a Python application defines:

>>> matrix_size = 5
>>> matrix_A = [[random.randint(1, 10) for _ in range(matrix_size)] for _ in range(matrix_size)]
>>> print(matrix_A)
[[9, 5, 9, 7, 7], [10, 9, 2, 5, 8], [8, 4, 2, 3, 10], [8, 3, 5, 4, 9], [8, 1, 7, 3, 10]]

I want PyObjects references including container objects and non-container objects:

[[9, 5, 9, 7, 7], [10, 9, 2, 5, 8], [8, 4, 2, 3, 10], [8, 3, 5, 4, 9], [8, 1, 7, 3, 10]], 
[9, 5, 9, 7, 7], 
[10, 9, 2, 5, 8], 
[8, 4, 2, 3, 10], 
[8, 3, 5, 4, 9], 
[8, 1, 7, 3, 10], 
9, 5, 7, 10, 9, 2, 8, 4, 3, 1, 10

But gc.get_objects() gives me a lot of PyObject created by internal VM, plus no integer variables since they are not tracked by GC. gc. get_referents() gives me [[2, 7, 3, 5, 8], [6, 6, 8, 1, 8], [8, 1, 5, 3, 6], [10, 2, 5, 6, 7], [5, 3, 3, 5, 8]] which has the same issue as of my purpose.

When previously I was looking at normal with-gil build, I need to hold the GIL and perform the recursion, with no problem. But GIL-held time causes too much overhead to Python application thus I'm looking at NO_GIL. But when not holding the GIL in NO_GIL build, some objects are dealloced by Py main thread, that my separate thread is not aware of, causing seg faults issue by dereferencing invalid addresses.

My current logic is added within Modules/gcmodule.c, and it's called from PyThread_start_new_thread() as a separate thread. But perhaps I should consider moving it outside. What you are saying seems _PyThreadState_Attach() stuff are not intend to be used like this, nor it's not a C API provided outside, but only for internal VM thread states maintenance already. Do you have any advices? Thanks.

colesbury commented 10 months ago

@DanielLee343 - you can't traverse all objects while other threads are running. The GC in nogil Python pauses other threads while it is running. If possible, you may be better of intercepting allocations and frees like some memory profilers do.

Otherwise, if you want to do this sort of analysis in nogil Python you need to:

1) Pause all threads while finding objects: https://github.com/colesbury/nogil/blob/8f9803ddf4af7e5a8c86a347ab26637f8c9ade5b/Modules/gcmodule.c#L1538-L1539 https://github.com/colesbury/nogil/blob/8f9803ddf4af7e5a8c86a347ab26637f8c9ade5b/Modules/gcmodule.c#L1601-L1603 2) Between _PyRuntimeState_StopTheWorld and _PyRuntimeState_StartTheWorld you can't call most Python APIs or you will deadlock. You can't call Py_DECREF() or PyObject_GetIter() or anything that might execute arbitrary Python code. You can call Py_INCREF()the "raw" memory allocation functions PyMem_RawMalloc(), and PyTypeObject.tp_traverse, but that's about it. 3) See visit_heap for how to find objects in nogil Python. For non-GC objects, you want the same code as visit_heap but with mi_heap_tag_obj. 4) Non GC-tracked objects may not be in a reasonable state. For example, some of their fields may be used for other purposes if they're in a freelist or such. You're mostly on your own here. You may need to modify _Py_NewReference, _Py_ReattachReference, _Py_ForgetReference to differentiate between objects that are in a reasonable state and ones that are not (because they're in a freelist or something).

But when not holding the GIL in NO_GIL build, some objects are dealloced by Py main thread, that my separate thread is not aware of, causing seg faults issue by dereferencing invalid addresses.

Again, to be clear, in nogil Python you need to pause other threads (via the stop-the-world APIs), so that they do not deallocate or mutate objects that you are trying to find.

My current logic is added within Modules/gcmodule.c, and it's called from PyThread_start_new_thread() as a separate thread. But perhaps I should consider moving it outside.

If you need to modify the runtime for your research that's fine, but the more non-standard things you want to do, the more likely you will run into issues.

DanielLee343 commented 10 months ago

@colesbury It seems I need to block other threads (either by _PyRuntimeState_StopTheWorld() in nogil or PyGILState_Ensure() in normal) regardlessly to collect live object information. Just some following up questions here.

See visit_heap for how to find objects in nogil Python. For non-GC objects, you want the same code as visit_heap but with mi_heap_tag_obj.

I mimicked what visit_heap() did and changed two things. 1) replaced gc_get_objects_visitor with my own bookkeeping data structure instead of the PyList_Object, thus doesn't mess up the VM heap internal layout, 2) changed 4 occurrences of mi_heap_tag_gc tag into mi_heap_tag_obj to track non-gc objects. However, the # objs I got was only 99 (which should be way much larger) for mi_heap_tag_obj tagged. Thus I suspect somethings went wrong. I cannot inspect what these 99 PyObjects are since it segfaults when trying to call Py_TYPE(op) internally.

Non GC-tracked objects may not be in a reasonable state. For example, some of their fields may be used for other purposes if they're in a freelist or such.

What do you mean by reasonable states? This probably is the reason for the above.

I also tried to instrument _Py_NewReference, _Py_ReattachReference to proactively maintain all live objects but seems too much runtime overhead. I know this is not related to nogil but more of my own stuff, but I do appreciate any of your response.

colesbury commented 10 months ago

@DanielLee343 - sorry, I forgot that visit_heap() in this fork is different from the implementation I used in later versions (like nogil-3.12) and won't work with non-GC objects. It checks the "tracked" bit to determine which objects to visit, but that only makes sense for GC objects. non-GC objects don't have a tracked bit, so that strategy doesn't work and will probably filter out most objects and give you garbage.

You probably want to instead use mi_heap_visit_blocks called with visit_blocks=true.

That will get you most objects, but if you have multiple threads, and some of them exit, it may miss some objects. You'll also need to visit the abandoned segments. When a thread finishes without freeing all of the memory it allocated, it pushes the in-use segments (data structure containing memory blocks), to a global abandoned segment list to be later claimed by another thread. Memory there isn't "owned" by any thread and not part of any mi_heap, but still contains live objects. You'll need to basically combine the logic of visit_segment (from gcmodule.c) with mi_heap_area_visit_blocks (from mimalloc/heap.c).

What do you mean by reasonable states?

You can end up with partially destroyed objects. For example, a thread may be in the process of calling an object's tp_dealloc and have deallocated some of its pointed-to objects, but not cleared those fields. Objects tracked by the GC are guaranteed to be in a "good" state -- valid ob_type, member fields either point to valid objects or NULL, etc. But that's not necessarily true of Python objects that aren't tracked by the GC.

DanielLee343 commented 10 months ago

Hi @colesbury I followed your guide mimicked what mi_heap_visit_blocks does with visit_blocks=true, like this:

visit_blocks(...)
{
    [...]
    allocated_blocks += 1;
    // PyObject *op = (PyObject *)block;
    // Py_ssize_t cur_refcnt = Py_REFCNT(op); // works fine
    uint32_t hotness = op->hotness; // works fine
    op->hotness = 0; // seg faults
    [...]
}

It shows roughly the same amount of objects as what I tested previously, but with much quicker time (which I'm very happy). This visit_blocks is called in _Py_GetAllocatedBlocks_dup() in my bookkeeping thread under GIL held and stop-the-world like what you told me:

PyGILState_STATE gstate = PyGILState_Ensure();
_PyMutex_lock(&_PyRuntime.stoptheworld_mutex);
_PyRuntimeState_StopTheWorld(&_PyRuntime); // needs gil held

_Py_GetAllocatedBlocks_dup(mainState, table);

PyGILState_Release(gstate);
_PyRuntimeState_StartTheWorld(&_PyRuntime);
_PyMutex_unlock(&_PyRuntime.stoptheworld_mutex);

And the mainState is the main PyThreadState * that I preserved before entering PyThread_start_new_thread(). I believe this is the correct logic since everything is done under stop-the-world.

But when I do Py_ssize_t cur_refcnt = Py_REFCNT(op) in visit_blocks() it segfaults because of invalid memory address. So my question is, is each void *block equivalent to the address of the actual * PyObject? I remember it holds for normal withgil CPython, but since nogil uses mimalloc, by something like mi_heap_malloc(tstate->heaps[mi_heap_tag_obj], nbytes); I'm not sure how to correlate * block with * PyObject here? Thank you.

Edit: It seems purely reading the field of PyObject works fine, but when I write to it, the main thread segfaults. The hotness field above is what I added to the PyObject struct for bookkeeping, and it's written only within visit_blocks() within stop-the-world. Backtrace shows (main thread):

Thread 1 (Thread 0x7ffff7ea7780 (LWP 3286166) "python"):
#0  0x000055555560a435 in _Py_atomic_load_uint32_relaxed (address=0x8) at ./Include/pyatomic_gcc.h:270
#1  0x000055555560b9d7 in _Py_INCREF (op=0x0) at ./Include/object.h:508
#2  list_item_locked (self=0x2039d711190, idx=0, dead=0x0) at Objects/listobject.c:156
#3  0x000055555560bb0f in list_item_safe (self=0x2039d711190, idx=0) at Objects/listobject.c:173
#4  0x000055555560bcc6 in list_item (self=0x2039d711190, idx=0) at Objects/listobject.c:194
#5  0x000055555561463b in list_subscript (self=0x2039d711190, item=0x20399920c60) at Objects/listobject.c:3228
#6  0x000055555588f23d in PyObject_GetItem (o=0x2039d711190, key=0x20399920c60) at Objects/abstract.c:157

caused by _Py_INCREF(), which I'm confused, since I don't think I've written to ob_ref_local nor ob_ref_shared field in my bookkeep thread. The mi_heap_visit_blocks also doesn't do that I believe.