Open markshannon opened 11 months ago
We should also rip out PEP 445 allocators.
We can provide a pluggable API for getting the big chunks of memory for the allocator, as the cost of the extra indirection on top of a large allocation would be negligible.
I'm sure you've seen this too, but another place that an integrated memory allocator could help is with the GC. My understanding is that a big part of why the CPython GC is slower than other vms' is that it has no help from the memory allocator for common things like enumerating live objects, so it has to maintain its own doubly linked list which is expensive both in terms of memory and also time due to the traversals being non-sequential. Having the allocator and GC be completely decoupled is definitely nice, but I think there are performance wins to be had if there's a willingness to integrate them.
@colesbury Your thoughts?
A few random thoughts below. I think mimalloc
provides most of the APIs you've mentioned.
The API should provide inline functions for allocation and freeing memory.
I tried this with mimalloc and did not measure a performance improvement. When I talked with Daan Leijen about this, I think he said that other projects also did not benefit from inlining the malloc functions. I think this may be for a few reasons, including that the inlined fast path malloc code still has to potentially call the out-of-line slow-path so you don't really reduce register spilling from inlining it.
I think there's a bigger ROI on optimizing our (CPython) side of the allocation. We often have to go through a bunch of indirections and then read size information from PyTypeObject, even when it's knowable to the compiler as a constant.
The free function should take a size...
The mentioned allocators provide functions that take a sized free. My understanding is that for jemalloc
this is important for good performance, but it's not useful for mimalloc
due to different design tradeoffs around internal data structures. mimalloc provides a mi_free_size(void* p, size_t size)
, but it just ignores the size argument.
The malloc and free functions should take a PyThreadState * parameter. The allocator's per-thread data structure can be embedded in the PyThreadState
I did this in nogil-3.9 by putting mi_heap_t
, which is mimalloc's per-thread data structure, in PyThreadState.
If the allocator exposed the size classes that it uses internally, then allocations can be aligned to size classes, and reallocations just increment the size class
Agreed - this is what I've done when allocating PyListObject backing arrays in nogil-3.9.
Currently, to allocate a new object we use a mix of per-class freelists for some common classes, a custom small object allocator, plus the system malloc. Both the custom small object allocator and system malloc are hidden behind an extra level of indirection.
We should integrate the allocator into CPython, using a sensible API to allow the allocator to be replaced and developed largely independently of the rest of CPython.
First a few observations.
PyThreadState
), so there is no need for allocator to duplicate this.What's wrong with using mimalloc/jemalloc/tcmalloc?
There is nothing wrong with the allocators. But they are general purpose implementations of the C malloc/free API. It is not the allocators that are a problem, but the API.
How a new API can be tailored to our use case:
The API should provide inline functions for allocation and freeing memory.
This probably doesn't impact the API, but may effect the chosen sizes of free-lists for these size classes
We can't use a moving GC, so we must use a free-list based allocator (all the major allocators already work this way)
The
free
function should take a size. This can be checked, if necessary, but going straight to the freelist is faster than traversing internal data structures.If the allocator exposed the size classes that it uses internally, then allocations can be aligned to size classes, and reallocations just increment the size class
The
malloc
andfree
functions should take aPyThreadState *
parameter. The allocator's per-thread data structure can be embedded in thePyThreadState
The API
The
Py
prefix could bePyUnstable_
, at least for the first iteration.