Open markshannon opened 3 months ago
The above API handles individual objects. We would need something different, but similar, if we want to handle arbitrary blocks of C data (e.g. module state).
For arbitrary data we can't use Py_DECREF
, so we need to add a callback for cleanup. Which gives this API:
typedef struct { uintptr_t index; funcptr cleanup } PyGlobal;
/* Declare a global */
#define PyGLOBAL_DECLARE(NAME) PyGlobal NAME = PY_GLOBAL_INIT;
/* Initialize global, this must be called at least once per-process.
* This function is idempotent, so can be called whenever a module is loaded */
PyGlobal_Init(PyGlobal *name, funcptr cleanup);
void *PyGlobal_GetData(PyGlobal name);
void PyGlobal_StoreData(PyGlobal name, void *data);
void*
PyGlobal_LoadData(PyGlobal name)
{
return _PyThreadState_GET()->globals_table[name.index];
}
void
PyGlobal_StoreData(PyGlobal name, void *data)
{
void **table = _PyThreadState_GET()->globals_table;
void *tmp = table[name.index];
table[name.index] = data
name.cleanup(tmp);
}
I'm not a fan of the TLS version of _PyThreadState_GET()
, since it has made things slow down on non-linux OSes incl my Windows. (IIRC, get_state()
in obmalloc.c
has been one of the bottlenecks.)
Also, please consider enhancing METH_METHOD
C function calls: https://github.com/python/cpython/issues/123500.
We recently saw a big performance regression on the telco benchmark when the decimal module was moved to multi-phase init. Accessing state is now much slower than before. Anecdotally, accessing a global now takes 7 dependent loads instead of 1. (@mdboom do you have a link for this?)
If we make the observation that we do not need per-module variables, but per interpreter ones, to replace (C) global variables, we can design an API that needs much fewer indirections.
This API is largely stolen from HPy with a few tweaks for better performance. https://docs.hpyproject.org/en/stable/api-reference/hpy-global.html
Implementation
Each interpreter states has a reference to an array of
PyObject *
pointers.PyGlobal_Init()
initializes the global to so non-zero index and makes sure that each interpreter has a table large enough to store that index. Then load and store can be implemented as follows