capi-workgroup / api-revolution

Proposals for radical or controversial API changes
29 stars 0 forks source link

Forward-compatibility for special case APIs #4

Open zooba opened 11 months ago

zooba commented 11 months ago

One of the issues we've seen in the existing API is that we grow more specific functions over time, for performance and other reasons.

For example, PyObject_GetItem is theoretically sufficient for all get-item scenarios, and yet we offer a variety of public APIs that streamline certain operations.[^1][^2]

[^1]: Particularly for dict objects, but those are largely due to back-compat with design flaws we made in the past.

[^2]: I recognise many of these are implementations of type-specific handlers for GetItem, but even if we wanted to make them internal now, we can't and wouldn't.

So I want to propose an API that lets us extend these into the future, allowing even more special/optimised cases without causing undue burden on maintainers. Names are obviously all going to be bikeshedded, and I suspect this will fit nicely into one of the other proposals going around, but this is the aspect I personally care most about so happy to see it merged into another one. But until then, I'm going to call it PyObject_GetSlots.

The basic idea is that each object may return a struct containing function pointers that are specific to its current instance, based on the argument given by the caller. Thinking about how you might perform __getitem__, here's a more fully spelled out example (again, using current naming/objects, but could be adapted to fit a new API design):

// These definitions would be #included from our own headers
// I don't particularly care what type the slot names are, provided they're statically linked into the caller
typedef (...) PySlotName;
#define PY_SLOT_GETITEM (...)
#define PY_SLOT_GETITEM_STR (...)
#define PY_SLOT_GETITEM_INT (...)

struct PySlots_Base {
    int (*release)(struct PySlots *self);
};

struct PySlots {
    void *handle;
    struct PySlots_Base *slots;
} PySlots;

PyAPI_FUNC(int) PyObject_GetSlots(PyObject *o, PySlotName name, PySlots *slots);

static inline int PySlots_Release(PySlots *slots) {
    return (*slots->slots->release)(slots);
}

struct PySlot_GetItem {
    struct PySlots_Base base;
    PyObject * (*getitem)(PySlots *self, PyObject *key);
};

struct PySlot_GetItemString {
    struct PySlots_Base base;
    PyObject * (*getitem_utf8)(struct PySlot_GetItemString *self, const char *key);
    PyObject * (*getitem_utf16)(struct PySlot_GetItemString *self, const wchar_t *key);
};

struct PySlot_GetItemInt {
    struct PySlots_Base base;
    PyObject * (*getitem_int)(struct PySlot_GetItemInt *self, int key);
    PyObject * (*getitem_ssize_t)(struct PySlot_GetItemInt *self, Py_ssize_t key);
};

// This would be in user's code (or maybe even a header or static library)

void f(PyObject *dict, ssize_t k) {
    PyObject *value;
    PySlots dict_getitem;

    if (PyObject_GetSlots(dict, PY_SLOT_GETITEM_INT, &dict_getitem))) {
        // Obtained the fast path
        value = (*((struct PySlots_GetItemInt *)dict_getitem.slots)->getitem_ssize_t)(&dict_getitem, k);
        PySlots_Release(&dict_getitem);
    } else if (PyObject_GetSlots(dict, PY_SLOT_GETITEM, &dict_getitem)) {
        // Fast path not available on this version, use the slow path
        PyObject *key = PyLong_FromSSizeT(k);
        value = (*((struct PySlots_GetItem *)dict_getitem.slots)->getitem)(&dict_getitem, key);
        PySlots_Release(&dict_getitem);
        Py_DECREF(key);
    } else {
        // should be unreachable unless we deprecate/remove PY_SLOT_GETITEM
    }

    // handle value==NULL or else go on and use it
}

Key points:

The biggest downside is that it can get to be really messy C code when used directly (I'm not 100% sure I typed it right in my example), but headers/macros and/or a static library could really help. As long as it's statically compiled into the caller, and not part of the Python runtime itself, or it will break on earlier versions.

zooba commented 11 months ago

Additional point: you can #ifdef test for the presence of the slot name to determine whether the struct is available. So it's possible to make the code above compile with older versions of the headers like this:

#ifdef PY_SLOT_GETITEM_INT
    if (PyObject_GetSlots(dict, PY_SLOT_GETITEM_INT, &dict_getitem))) {
        // Obtained the fast path
        value = (*((struct PySlots_GetItemInt *)dict_getitem.slots)->getitem_ssize_t)(&dict_getitem, k);
        PySlots_Release(&dict_getitem);
    } else
#endif
#ifdef PY_SLOT_GETITEM  // wouldn't do this because it would always exist... but for example
    if (PyObject_GetSlots(dict, PY_SLOT_GETITEM, &dict_getitem)) {
        // Fast path not available on this version, use the slow path
        PyObject *key = PyLong_FromSSizeT(k);
        value = (*((struct PySlots_GetItem *)dict_getitem.slots)->getitem)(&dict_getitem, key);
        PySlots_Release(&dict_getitem);
        Py_DECREF(key);
    } else
#endif
    {
        // should be unreachable unless we deprecate/remove PY_SLOT_GETITEM
    }
encukou commented 11 months ago

When you know PY_SLOT_GETITEM_INT exists in the world, you'll also know its value. So it might be better to define it yourself:

#ifndef PY_SLOT_GETITEM_INT
#define PY_SLOT_GETITEM_INT 42
#endif

or find a compat header library that does that...


Would custom tpyes be expected to provide these? What would the API look like? Do we expect that users of PY_SLOT_GETITEM_INT should always fall back to PY_SLOT_GETITEM?

zooba commented 11 months ago

When you know PY_SLOT_GETITEM_INT exists in the world, you'll also know its value.

You'll also need the struct layout and all the prototypes, so you'd definitely get it from our headers. The only reason you'd test for the value is to see if you have all of those defined - it's for source compatibility, not for binary compatibility.

Would custom tpyes be expected to provide these? What would the API look like?

Could do. I guess it would be a PyTypeObject member with the same signature as PyObject_GetSlots. There's no reason for it to look different from internal implementations.

Do we expect that users of PY_SLOT_GETITEM_INT should always fall back to PY_SLOT_GETITEM?

If they want to actually get an item, yeah. If they'd rather refuse because it'll be "too slow," that's up to them. But I expect a helpful inline function that does the fallback would be popular. Then you compile with the latest available Python headers and get whichever is the "best"/fastest/safest behaviour available on whatever version you're running on.

encukou commented 11 months ago

What are the advantages over only providing that "helper" -- a "caller" API like:

PyObject *Py_GetItem_ssize(PyObject *obj, ssize_t *key);

where Python itself would try the proper fallbacks?

Refusing a fallback because it's too slow doesn't seem very compelling...

zooba commented 11 months ago

Remember the context is compatibility over multiple releases, so I'm assuming that the specialised function doesn't exist in version N, but is added in N+1 (or later), and we want to support extension modules that can be loaded in any version from N onwards.

Code written for version N is going to use a regular GetItem that takes a PyObject key, because that's the only one available in version N. Code written for N+1 could use a newly exported GetItem_ssize that has its own fallback (if obj doesn't support integer keys), but if GetItem_ssize is an export then that module will no longer work on version N, because it wasn't exported in that version. The code would have to dynamically import GetItem_ssize and fall back to GetItem in case it's not present.

If GetItem_ssize is an inline function in the headers, then absolutely we could provide it and do the fallback ourselves. I even said as much in my first post 😉 The trick is that it has to be compiled into the extension and not in libpython. So a static import library would be fine, or macros/inline functions in headers.

The rest is really just a more efficient way of doing polymorphism upfront. For example, if you PyObject_GetSlots on a dict object, we can return a static table of pointers to functions that assume they're getting a dict object and don't even type check. Make the same call on a list and each function can then assume it's a list. If we tried to do the same thing with DLL exports, we'd just have thousands of public APIs and it would be quite unclear how to use them.

We could even get crazy in certain circumstances. For example, we could have a slots type for "list of up to 8 integers" that just contains int n; Py_ssize_t values[8]; - there's no reason they need to be functions. The model I'm proposing here is flexible to allow this, and saves us defining new functions every time we have a new idea.[^1]

(Or for a more normal example, every protocol could have a slots struct - __fspath__ or __index__ would be good candidates IMHO. If an arbitrary class implements it, PyObject_GetSlots can return a struct that'll call it - if not, it returns false - and no need to modify PyTypeObject in a publicly visible way.)

[^1]: It might be a smart move to make the PySlots struct include more extra space than just a void*, which could mean we can do zero-allocation returns for more types.

encukou commented 11 months ago

Ah, I see. It solves a problem with the limited API that you can't use newer features/optimizations if you want to support an old Python version.

Given that you can compile wheels for the new & old versions separately, the complexity might not be worth it.

zooba commented 11 months ago

Given that you can compile wheels for the new & old versions separately ...

Uhm... are you suggesting that people should just compile wheels for each version? What's the point of having a stable ABI/API between versions then?

encukou commented 11 months ago

Yeah, you're right. It expands the scope of the stable ABI, but it'd be a welcome enhancement.

https://github.com/capi-workgroup/api-evolution/issues/1 is another possible approach to solving it.

zooba commented 11 months ago

The two approaches come at it from different directions.

This one gives us a path to design an ABI level that remains compatible over time, and then we can build helpers on top of it to make life easier for end users (similar in principle to #1 Native Interface proposal, but this would work as a component of that rather than an alternative).

The other approach assumes that we'll continue to add, change and remove ABI members over time in incompatible ways, but will include shims in user's code as needed to handle the differences. This doesn't handle forward-compatibility, because we still can't predict future changes, but it does allow us to handle backwards compatibility (though, IMHO, no better than this proposal). It also doesn't allow us to safely provide API-level optimisations or an interface that can be efficient for other Python implementations.

But as notable from being in separate repositories, they aren't mutually exclusive. We can invest in a shim library (probably HPy is the right place for it?) while also developing an API structure that can eventually be the new stable ABI. If anyone does think they are totally exclusive, I'd like the opportunity to sell this one better, because I don't believe they are.

zooba commented 11 months ago

Oh, we also don't necessarily have to define every slots struct as belonging to the stable ABI if the API to get them does.

It would probably be appreciated by users if we at least have baseline functionality guaranteed to not go away, but provided that we can always point at what the fallback interface ought to be, we can always remove[^2] more specific slots in a later version. Code will still work, which is the promise, even if it loses a bit of performance.[^1]

[^1]: And yes, I know I suggested earlier that an extension might do something other than call the fallback functionality. But we're all consenting adults here. If they do that, they are intentionally planning to break themselves.

[^2]: Remove, but not modify.

encukou commented 11 months ago

The more I think about this the more questoins I have :)

We probably want a linear sequence of fallbacks. Let's say we add a context argument to all the signatures. Is the proper order PY_SLOT_GETITEM_INT_CTX -> PY_SLOT_GETITEM_INT -> PY_SLOT_GETITEM_CTX -> PY_SLOT_GETITEM, or the other reasonable permutation?

What happens if a user uses the wrong order? A subtle behaviour difference? Is that a bug in the exporting object (do we require that the fallbacks have equivalent behaviour)? What if different slots are defined in different classes in the MRO?

I can see fallbacks being optional: If there's a slot to "get an int value from a dict", but it overflows, you might, but might not, want to fall back to "get an object from dict".

zooba commented 11 months ago

The "proper order" will depend on the particular slots, so it'd have to be part of the definition. For example, PY_SLOT_GETITEM_CTX would probably be documented as "if unavailable, PY_SLOT_GETITEM is the best fallback, except for <difference in performance/behaviour>".

If the user uses the wrong order, then one of the earlier checks will succeed and none of the later ones will occur. So if they check for PY_SLOT_GETITEM first, when that succeeds (presumably always), every else condition after it is essentially dead code. (Unless we one day deprecate that particular slot, in which case they can get warnings for a while and then the next else condition will be taken. Again, "getitem(object)" is an unlikely candidate for removal, but the principle applies to all slots.)

What if different slots are defined in different classes in the MRO?

The identifier is the name, so PY_SLOT_GETITEM will be fully provided by one implementation (presumably the first in the MRO), and would include all members that are part of it (we might put contains and getitem in the same struct, for example).

I haven't specified any MRO logic at all here, btw, since we don't currently have that in native types. Most likely, your PyTypeObject would specify exactly one handler and then that would manually delegate unrecognised names to another type's handler. User-defined classes would probably have a generic implementation for a slot that does the attribute (or legacy slot) lookup when invoked - these slots are about finding the right native function to call, and calling PyObject_GetItem on an object that doesn't support getitem is technically okay, even though it's going to raise an error.

I can see fallbacks being optional: If there's a slot to "get an int value from a dict", but it overflows, you might, but might not, want to fall back to "get an object from dict".

Yeah, exactly. Or even just getting an int value from a long-like object. Any native type could implement a PY_SLOT_AS_INT64 slot with their own int64_t (*get_value)(PySlots*) member, whereas today we would need the type to create an actual PyLongObject so that our conversion functions work.

And "this should've worked, but can't" is often enough of an error condition to make it an error immediately rather than going through and trying more native APIs.

encukou commented 11 months ago

It seems that one way to look at this is as an extension of the current sub-slot structs, but with an extra layer of indirection (which the object controls). The indirection might be too slow for core (and thus Cython); should there be an alternate API that's faster for common operations?

It would be helpful to have a concrete example for how the “provider” side would be implemented -- what a type author should do to make PyObject_GetSlots work for a particular class.

zooba commented 11 months ago

In the simplest case, the implementer would do something like this:

// We've defined the interface somewhere in our public headers
struct PySlot_GetSetDelItem {
    struct PySlots_Base base;
    PyObject * (*getitem)(PySlots *self, PyObject *key);
    PyObject * (*setitem)(PySlots *self, PyObject *key, PyObject *value);
    PyObject * (*delitem)(PySlots *self, PyObject *key);
};

// The type implementer defines a (potentially static) function table in their code
struct PySlots_GetItem _dict_getitem_slots = {
    _PyDictSlots_Release,
    &_PyDictSlots_GetItem,
    &_PyDictSlots_SetItem,
    &_PyDictSlots_DelItem
};

// Assuming we've defined PySlots as:
struct PySlots {
    void *handle;
    PySlotName name;
    struct PySlots_Base *slots;
};

// Their "...Slots_Get" function returns the right slots for the requested name.
// This function is found in the PyTypeObject of 'o'
static int
_PyDictSlots_Get(PyObject *o, PySlotName name, PySlots *slots)
{
    switch (name) {
    case PY_SLOT_GETSETDELITEM:
        slots->handle = Py_NewRef(o);
        slots->name = name;     // I just added this, realised we need it for Release
        slots->slots = (struct PySlots_Base*)&_dict_getitem_slots;
        return 0;
    }
    return _PySlots_GenericGet(o, name, slots);
}

static int
_PyDictSlots_Release(PySlots *slots)
{
    switch (slots->name) {
    case Py_SLOT_GETSETDELITEM:
        Py_DECREF((PyObject *)slots->handle);
        return 0;
    }
    return _PySlots_GenericRelease(slots);
}

// For this example we just redirect to the existing function, but this could
// contain the actual implementation. It doesn't necessarily have to just be
// a wrapper around the more obvious function.
static int
_PyDictSlots_GetItem(PySlots *self, PyObject *key)
{
    return PyDict_GetItem((PyObject *)self->handle, key);
}

I also wrote up a rough example of an efficient "AsLong" slots, but posted it in a Gist rather than cluttering up this thread more. It's a very hypothetical idea, but I think it shows the potential efficiency gains more than GetItem does.

encukou commented 11 months ago

Thanks! I still think the indirection will be problematic for core CPython and Cython. We might need something like baking “current” slots into the type object more statically (and with lower compatibility expectations). But that sounds doable.

Two questions to hash out:

Currently, Python knows about all slots so it can update them appropriately.

zooba commented 11 months ago
  • What happens if I override __getitem__ in a subclass?

Then you'll also override _PyDictSlots_Get as your tp_slots or whatever, and will return something other than _dict_getitem_slots when queried (though you might choose to be compatible/brave enough to reuse some of its functions).

  • What happens if I replace __getitem__ from Python on a mutable class that exposes slots?

I imagine for a mutable class, the slots would be generic (i.e. they'd do the return getattr(self, "__getitem__")(key)). But while a generic GetSlots would return these generic handlers, a specialising GetSlots could use caching/profiling/checks to return faster ones where possible.

This API doesn't really do a huge amount out of the box for Python user-defined classes. But in any case, those are based on our C type, and so we still know all the slots. Just that our tp_mapping (or equivalent) structs of callables would be a completely private cache.

I still think the indirection will be problematic for core CPython and Cython.

You mean performance-wise? Obviously there'll be a minor perf regression compared to assuming the type and calling the function directly, but by using the indirection it means core native functions can efficiently use 3rd party types, whereas today that only works when they haven't overridden our own implementations.

I'm sure Cython will choose to make assumptions about internal layouts that will eventually come back to bite them (after all, that's exactly what they do today). e.g. they'll probably assume that the getitem slots for one dict will work for any dict, which would probably be true up until the point where it isn't.

What they'd gain is the option to have unconditional code for using faster interfaces that still works (slower) on older versions. So if we add a PySLOT_FAST_MUL_INT to multiply a PyLongObject by an int constant, they can start querying for that on all versions (with a fallback to regular __mul__ for versions that don't have it yet, which must already have existed). So all Cythonised extensions could be compiled once to cover all active CPython releases and still be efficient, rather than users having to build one wheel per version.

encukou commented 11 months ago

Let me give concrete hypothetical scenarios. These involve inheritance, and exploit (perhaps too much) the mechanism that decouples the slots from the interpreter. I don't see a good way to make them both work. If you do, or you see something that shouldn't be possible, could you correct me?

Scenario 1

Here, foo_ide should see the repr for HyperDataClass, not from the superclass. In other words, PY_SLOT_REPR_UTF8ANDLEN should not be inherited.

Scenario 2

Here, NPY_SLOT_NDARRAY should be inherited from SuperArray, so that it still works like the supertype does.

zooba commented 11 months ago

Scenario 2 is easy, because your SuperArray.tp_getslot will handle the slots it knows and then call SuperArray.tp_base.tp_getslot for anything else (or along those lines).

Scenario 1 is hard in general. If a subclass doesn't override a protocol, the superclass provides it, regardless of whether we're talking about this proposal or existing Python code.

Just thinking out loud here, but maybe we could mark some slots as uninheritable (or alternatively, "must be implemented by the most derived class")? If they're ints, maybe the topmost bit is set. So if PY_SLOT_REPR_UTF8ANDLEN is marked as such, your subclass will refuse to return PY_SLOT_REPR_UTF8ANDLEN even though it doesn't know it even exists, and the caller will call again with PY_SLOT_REPR. When you learn it exists, you either implement it directly or by calling your superclass.