faster-cpython / ideas

1.69k stars 49 forks source link

More efficient deallocation. #482

Open markshannon opened 2 years ago

markshannon commented 2 years ago

This is in part motivated by https://github.com/faster-cpython/ideas/discussions/402. It is also an attempt to avoid the inefficiencies in https://github.com/python/cpython/pull/27738 It also relates to https://github.com/faster-cpython/ideas/discussions/132. it is also needed to implement https://github.com/python/cpython/issues/98260 efficiently

Almost all objects end up on a freelist when de-allocated, about half in an explicit freelist, and the other half in an ob_malloc freelist. However, the amount of indirection and overhead to get from _Py_Dealloc to adding something to the freelist can be huge. To free an int the following happens:

We want to do two things to improve performance.

  1. Get from Py_DECREF() to PyObject_Free more efficiently
  2. Get from PyObject_Free to putting the memory on the freelist more efficiently.

Getting from Py_DECREF() to PyObject_Free more efficiently

Rather than every extension class writing its own dealloc and free functions, types should set flags to indicate whether they:

We need two bits in tp_flags to express this.

For objects that are just lumps of memory we can set tp_dealloc to point to PyObject_Free avoiding the extra indirection. The other cases would get their own function pointers, but would can do some of the dispatching at class creation time, not at object deallocation time.

Getting from PyObject_Free to putting the memory on the freelist more efficiently.

See https://github.com/faster-cpython/ideas/discussions/132 for implementation details of freelists.

We need to compute the size of the object quickly to determine the freelist to use. Any class that uses the standard allocator PyType_GenericAlloc can have its size computed reliably. Other classes would need to use the current generic approach, possibly with a few customizations

nascheme commented 2 years ago

This optimization sounds like a good idea to me. @markshannon , does it help if I get #27738 merged first? Integrating the trashcan mechanism (basically just a way to avoid blowing up the C stack on decref/dealloc) into the runtime feels like the correct design. The trashcan pre-dates the GC head and so that's why it was done inside each type's dealloc method using the trashcan macros. There is no need to do that now. If we can eliminate the performance overhead of 27738, that seems clearly better.

Regarding flags in tp_flags, it would be really nice(TM) if we had flags inside PyObject somehow. One crazy idea: allocate two words in front of the PyObject for all objects, not just GC objects. Maybe the memory overhead is too large? If we could eventually move extensions over to using Py_TYPE and the refcnt macros/functions, we could move the type and refcount into there. Also, you would have to force all PyObject allocations to go through the CPython object allocator. I'm not sure how many extensions actually do that but there must be at least some.

ericsnowcurrently commented 2 years ago

Regarding flags in tp_flags, it would be really nice(TM) if we had flags inside PyObject somehow.

The current PEP 683 (immortal objects) implementation uses 32-bit saturated refcounts, leaving us an opportunity with most of the remaining bits. Mark's been salivating at that for a while, including for use as per-object flags. 😄

(The PEP has been submitted to the steering council.)