faster-cpython / ideas

1.68k stars 49 forks source link

Faster access to per-interpreter globals. #692

Open markshannon opened 1 month ago

markshannon commented 1 month ago

We recently saw a big performance regression on the telco benchmark when the decimal module was moved to multi-phase init. Accessing state is now much slower than before. Anecdotally, accessing a global now takes 7 dependent loads instead of 1. (@mdboom do you have a link for this?)

If we make the observation that we do not need per-module variables, but per interpreter ones, to replace (C) global variables, we can design an API that needs much fewer indirections.

This API is largely stolen from HPy with a few tweaks for better performance.

typedef struct { uintptr_t index } PyGlobal;
/* Declare a global */

/* Initialize global, this must be called at least once per-process.
 * This function is idempotent, so can be called whenever a module is loaded */
PyGlobal_Init(PyGlobal *name);

PyObject *PyGlobal_Load(PyGlobal name);
void PyGlobal_Store(PyGlobal name, PyObject *value);


Each interpreter states has a reference to an array of PyObject * pointers. PyGlobal_Init() initializes the global to so non-zero index and makes sure that each interpreter has a table large enough to store that index. Then load and store can be implemented as follows

PyObject *
PyGlobal_Load(PyGlobal name)
      return Py_NewRef(_PyThreadState_GET()->globals_table[name.index]);

PyGlobal_Store(PyGlobal name, PyObject *value)
    PyObject **table = _PyThreadState_GET()->globals_table;
    PyObject *tmp = table[name.index];
    table[name.index] = Py_NewRef(value);
markshannon commented 1 month ago

The above API handles individual objects. We would need something different, but similar, if we want to handle arbitrary blocks of C data (e.g. module state).

For arbitrary data we can't use Py_DECREF, so we need to add a callback for cleanup. Which gives this API:

typedef struct { uintptr_t index; funcptr cleanup } PyGlobal;
/* Declare a global */

/* Initialize global, this must be called at least once per-process.
 * This function is idempotent, so can be called whenever a module is loaded */
PyGlobal_Init(PyGlobal *name, funcptr cleanup);

void *PyGlobal_GetData(PyGlobal name);
void PyGlobal_StoreData(PyGlobal name, void *data);
PyGlobal_LoadData(PyGlobal name)
      return _PyThreadState_GET()->globals_table[name.index];

PyGlobal_StoreData(PyGlobal name, void *data)
    void **table = _PyThreadState_GET()->globals_table;
    void *tmp = table[name.index];
    table[name.index] = data
erlend-aasland commented 1 month ago

Related discussion:

neonene commented 1 month ago

I'm not a fan of the TLS version of _PyThreadState_GET(), since it has made things slow down on non-linux OSes incl my Windows. (IIRC, get_state() in obmalloc.c has been one of the bottlenecks.)

neonene commented 1 week ago

Also, please consider enhancing METH_METHOD C function calls: