Open markshannon opened 2 years ago
This optimization sounds like a good idea to me. @markshannon , does it help if I get #27738 merged first? Integrating the trashcan mechanism (basically just a way to avoid blowing up the C stack on decref/dealloc) into the runtime feels like the correct design. The trashcan pre-dates the GC head and so that's why it was done inside each type's dealloc method using the trashcan macros. There is no need to do that now. If we can eliminate the performance overhead of 27738, that seems clearly better.
Regarding flags in tp_flags
, it would be really nice(TM) if we had flags inside PyObject somehow. One crazy idea: allocate two words in front of the PyObject for all objects, not just GC objects. Maybe the memory overhead is too large? If we could eventually move extensions over to using Py_TYPE
and the refcnt macros/functions, we could move the type and refcount into there. Also, you would have to force all PyObject allocations to go through the CPython object allocator. I'm not sure how many extensions actually do that but there must be at least some.
Regarding flags in
tp_flags
, it would be really nice(TM) if we had flags inside PyObject somehow.
The current PEP 683 (immortal objects) implementation uses 32-bit saturated refcounts, leaving us an opportunity with most of the remaining bits. Mark's been salivating at that for a while, including for use as per-object flags. 😄
(The PEP has been submitted to the steering council.)
This is in part motivated by https://github.com/faster-cpython/ideas/discussions/402. It is also an attempt to avoid the inefficiencies in https://github.com/python/cpython/pull/27738 It also relates to https://github.com/faster-cpython/ideas/discussions/132. it is also needed to implement https://github.com/python/cpython/issues/98260 efficiently
Almost all objects end up on a freelist when de-allocated, about half in an explicit freelist, and the other half in an
ob_malloc
freelist. However, the amount of indirection and overhead to get from_Py_Dealloc
to adding something to the freelist can be huge. To free an int the following happens:_Py_Dealloc
callsPyLongType.tp_dealloc
(via a function pointer, just to prevent the compiler doing its job :disappointed: )PyLongType.tp_dealloc
callsPyObject_Free
(again via function pointer)PyObject_Free
calls_PyObject_Free
(again via function pointer)_PyObject_Free
callspymalloc_free
which:ob_malloc
We want to do two things to improve performance.
Py_DECREF()
toPyObject_Free
more efficientlyPyObject_Free
to putting the memory on the freelist more efficiently.Getting from
Py_DECREF()
toPyObject_Free
more efficientlyRather than every extension class writing its own dealloc and free functions, types should set flags to indicate whether they:
tp_dealloc
function that can do anything.We need two bits in
tp_flags
to express this.For objects that are just lumps of memory we can set
tp_dealloc
to point toPyObject_Free
avoiding the extra indirection. The other cases would get their own function pointers, but would can do some of the dispatching at class creation time, not at object deallocation time.Getting from
PyObject_Free
to putting the memory on the freelist more efficiently.See https://github.com/faster-cpython/ideas/discussions/132 for implementation details of freelists.
We need to compute the size of the object quickly to determine the freelist to use. Any class that uses the standard allocator
PyType_GenericAlloc
can have its size computed reliably. Other classes would need to use the current generic approach, possibly with a few customizations