More efficient allocation and deallocation through better integration of memory allocator

markshannon commented 11 months ago

Currently, to allocate a new object we use a mix of per-class freelists for some common classes, a custom small object allocator, plus the system malloc. Both the custom small object allocator and system malloc are hidden behind an extra level of indirection.

We should integrate the allocator into CPython, using a sensible API to allow the allocator to be replaced and developed largely independently of the rest of CPython.

First a few observations.

Objects are allocated and deallocated very fast
Object sizes are heavily biased towards the 32-64 byte range.
We cannot move most objects. This is CPython
Python objects are self describing, so know their own size. The allocator does not need to provide this information.
When resizing lists, dicts, sets, etc. our choice of scaling factors is based on old experiments with glibc malloc, or abstract theory.
We have a per-thread data structure readily available (the PyThreadState), so there is no need for allocator to duplicate this.

What's wrong with using mimalloc/jemalloc/tcmalloc?

There is nothing wrong with the allocators. But they are general purpose implementations of the C malloc/free API. It is not the allocators that are a problem, but the API.

How a new API can be tailored to our use case:

Objects are allocated and deallocated very fast

The API should provide inline functions for allocation and freeing memory.

Object sizes are heavily biased towards the 32-64 byte range.

This probably doesn't impact the API, but may effect the chosen sizes of free-lists for these size classes

We cannot move most objects. This is CPython

We can't use a moving GC, so we must use a free-list based allocator (all the major allocators already work this way)

Python objects are self describing, so know their own size. The allocator does not need to provide this information.

The free function should take a size. This can be checked, if necessary, but going straight to the freelist is faster than traversing internal data structures.

When resizing lists, dicts, sets, etc. our choice of scaling factors is based on old experiments with glibc malloc, or abstract theory.

If the allocator exposed the size classes that it uses internally, then allocations can be aligned to size classes, and reallocations just increment the size class

We have a per-thread data structure readily available (the PyThreadState), so there is no need for allocator to duplicate this.

The malloc and free functions should take a PyThreadState * parameter. The allocator's per-thread data structure can be embedded in the PyThreadState

The API

The Py prefix could be PyUnstable_, at least for the first iteration.

/* Convert size to size class and vice versa.
 * These functions should be evaluated at compile-time
 * if the input is a constant */
unsigned int Py_SizeClass_FromSize(size_t size);
size_t Py_Size_FromSizeClass(unsigned int size_class);

/* Allocation functions */
inline void *Py_Allocate_FromSize(PyThreadState *tstate, size_t size);
inline void *Py_Allocate_FromSizeClass(PyThreadState *tstate, unsigned int size_class);

/* Free functions */

void Py_Free_FromSize(PyThreadState *tstate, size_t size, void *memory);
void Py_Free_FromSizeClass(PyThreadState *tstate, unsigned int size_class, void *memory);

/* The allocator may not know the size of an object, but it should know its size class */

unsigned int _Py_SizeClass_FromSize(size_t size);

/* And finally convenience functions for easy porting */

void *Py_Allocate(size_t size);
void Py_Free(void *memory);

markshannon commented 11 months ago

We should also rip out PEP 445 allocators.

We can provide a pluggable API for getting the big chunks of memory for the allocator, as the cost of the extra indirection on top of a large allocation would be negligible.

kmod commented 11 months ago

I'm sure you've seen this too, but another place that an integrated memory allocator could help is with the GC. My understanding is that a big part of why the CPython GC is slower than other vms' is that it has no help from the memory allocator for common things like enumerating live objects, so it has to maintain its own doubly linked list which is expensive both in terms of memory and also time due to the traversals being non-sequential. Having the allocator and GC be completely decoupled is definitely nice, but I think there are performance wins to be had if there's a willingness to integrate them.

gvanrossum commented 10 months ago

@colesbury Your thoughts?

colesbury commented 10 months ago

A few random thoughts below. I think mimalloc provides most of the APIs you've mentioned.

The API should provide inline functions for allocation and freeing memory.

I tried this with mimalloc and did not measure a performance improvement. When I talked with Daan Leijen about this, I think he said that other projects also did not benefit from inlining the malloc functions. I think this may be for a few reasons, including that the inlined fast path malloc code still has to potentially call the out-of-line slow-path so you don't really reduce register spilling from inlining it.

I think there's a bigger ROI on optimizing our (CPython) side of the allocation. We often have to go through a bunch of indirections and then read size information from PyTypeObject, even when it's knowable to the compiler as a constant.

The free function should take a size...

The mentioned allocators provide functions that take a sized free. My understanding is that for jemalloc this is important for good performance, but it's not useful for mimalloc due to different design tradeoffs around internal data structures. mimalloc provides a mi_free_size(void* p, size_t size), but it just ignores the size argument.

The malloc and free functions should take a PyThreadState * parameter. The allocator's per-thread data structure can be embedded in the PyThreadState

I did this in nogil-3.9 by putting mi_heap_t, which is mimalloc's per-thread data structure, in PyThreadState.

If the allocator exposed the size classes that it uses internally, then allocations can be aligned to size classes, and reallocations just increment the size class

Agreed - this is what I've done when allocating PyListObject backing arrays in nogil-3.9.

faster-cpython / ideas