Static reduction in reference count operations

mdboom commented 2 years ago

Depends on #443

Tagged pointers

The frame stack consists of intermixed locals array, stack and some linkage data. Some of the PyObject * pointers are already kept as borrowed references (globals and builtins) for efficiency, but all others are strong references. Using more borrowed references would be more efficient, but is easy to get wrong. By adding a tag bit to the pointer, we can use it to reliably unwind the stack, and sanity check our static analysis.

One other advantage of using a tag, is that we can mark non-references as "borrowed", simplifying frame exit and unwinding code. We no longer need to care whether a word is a pointer or not, as long as it is tagged correctly.

Deferring references

If we can prove that there is a strong reference to an object in a (transitive) caller then, provided we guard against deletion in debuggers, we can defer (make borrowed) any reference to that object. If we track which references are deferred statically we can eliminate all overhead for that reference. If we can only track it dynamically then we only need check the low-bit of the tagged value, and don't need to dereference it, still saving the memory accesses.

gvanrossum commented 2 years ago

So any stack entry might or might not have a tag bit, right? Then reading a PyObject * pointer from the stack might be done in the various POP() macros by replacing e.g. *--stack_pointer with ((PyObject *)((uintptr_t)(*--stack_pointer) & ~1)). For the various PUSH() macros we'd probably have to add a flag that gives the bit to be set (code an exercise left to the reader :-).

markshannon commented 2 years ago

The compiler knows which references are deferred so, instead of masking, we can use addition in the bytecode. Addition has effectively zero cost, when combined with a load from the pointer.

The reason we need tags at all, despite the compiler knowing whether a reference is deferred, is that exception unwinding won't know which are deferred statically, so will need a runtime check.

gvanrossum commented 2 years ago

Doesn’t masking also have zero cost?

gvanrossum commented 1 year ago

So let's take a really simple example suggested by Mark, a = b.attr. This translates to roughly

LOAD_FAST (b)
LOAD_ATTR (attr)
STORE_FAST (a)

This expands to roughly the following (excluding dispatch, asserts, error checks, and stack manipulations):

    // LOAD_FAST (b)
        value = GETLOCAL(oparg);
        Py_INCREF(value);  // <--------------------
    // LOAD_ATTR (attr)
        res = PyObject_GetAttr(value, name);
        Py_DECREF(value);  // <--------------------
    // STORE_FAST (b)
        SETLOCAL(oparg, res);

We're trying to optimize out the marked INCREF/DECREF pair.

Could we just make a super-instruction of the LOAD_FAST / LOAD_ATTR pair? That would allow us to at least assess the gains for this particular combination, and it's simple to do (just copy the code and remove the INCREF/DECREF pair).

Alas, the actual LOAD_ATTR instruction is actually quite complicated, we could simplify it by only specializing this if it's not part of a call (so the low bit of oparg is zero), but even then there's the added complication of all the possible specializations (LOAD_ATTR is a huge family). If we made this combo into a super-instruction but then lost the specializations we'd see a slowdown.

Could we prototype the "tag bit" idea instead, for just the combo of LOAD_FAST + LOAD_ATTR(_VARIANT)? If we take all variants that would still be pretty hairy (every mention of 'owner' should be modified to drop the tag bit, and every DECREF(owner) should be made conditional; plus there are some cases where ownership is transferred, in which case we'd actually have to add INCREF(owner) if the tag bit is set.

Hmm...

gvanrossum commented 1 year ago

An offline discussion revealed that Mark's suggestion is to use a different strategy -- just change the type of the stack to an array of structs that wrap a tagged pointer, and fix all the 10,000 places where the compiler balks. Hopefully most of those places are in code generated from bytecodes.c and we can make the code generator do the substitutions automatically.

E.g. if we have an instruction

instr(ADD, (left, right -- result)) {
    result = PyObject_Add(left, right);
    DECREF_INPUTS();
    ERROR_IF(result == NULL, error);
}

the generated code could be something like

TARGET(ADD) {
    TaggedPtr left = stack[-2];
    TaggedPtr right = stack[-1];
    PyObject *result;
    result = PyObject_Add(UNTAG(left), UNTAG(right));
    DECREF_UNLESS_TAGGED(left);
    DECREF_UNLESS_TAGGED(right);
    if (result == NULL) goto pop_2_error;
    stack_pointer -= 1;
    stack_pointer[-1] = UNTAGGED(result);
}

Where TaggedPtr is a struct type wrapping a possibly-tagged pointer, UNTAG() removes the tag bit, UNTAGGED() converts to a TaggedPtr instance without the tag bit set (there should also be a TAGGED() to set the tag bit), and DECREF_UNLESS_TAGGED() calls Py_DECREF() on the pointer value contained in the struct unless the tag bit is set.

In places that manipulate the stack directly (e.g. error handling, generators) we will have to manually fix the code using the "10,000 errors" approach.

(Since in ceval.c, Py_DECREF() is already a macro that's different from the official Py_DECREF() API, we could change this macro to automatically handle tag bits.)

I'm going to give myself two weeks to prototype this. If by March 15 I haven't gotten it working well enough to run a benchmark I propose to give up on the idea for 3.12.

(UPDATE: Whoops, this is a plan to work on dynamic tagged refcount handling, so should really go in #443.)

markshannon commented 1 year ago

I don't think we will get much from static reduction, at least not until the tier2 optimizer. Static analysis might help reduce the number of operations, but only by supporting better dynamic reductions.

For example, LOAD__ATTR (and it specializations) if often preceded by LOAD_FAST which can push a deferred refence, but it is also preceded by LOAD_GLOBAL, which cannot. Unless we duplicate LOAD_ATTTR and all its specializations, we are going to have to do this dynamically.

carljm commented 1 year ago

I think you can get a lot from doing it statically in the compiler, but it requires also surfacing incref and decref as separate operations that the compiler can explicitly include or not, instead of having them embedded inside the other opcodes. If you can do that, then it doesn't matter that e.g. LOAD_ATTR doesn't know if it's preceded by LOAD_FAST or LOAD_GLOBAL, because the compiler obviously knows which it is in any specific case.

In the Cinder JIT this is how we handle it: the compiler knows the refcounting behavior of each JIT HIR instruction (does it steal references to any inputs, does it output a new reference, does it have arbitrary memory effects that might invalidate other borrowed references, etc), and this allows a compiler pass to automatically insert the minimal refcount operations needed between instructions. This results in a pretty significant reduction in refcount operations (enough that we sometimes see behavior incompatibilities due to objects having lower refcount than in CPython.) (I can collect more detailed numbers on how much the reduction is, if that's interesting to anyone.)

I'm not sure any of that is useful for the tier 1 interpreter though, since exposing refcount operations as explicit instructions probably would come at too high a cost in code size and interpreter loop overhead.

markshannon commented 1 year ago

We are talking about the bytecode compiler, not a JIT compiler. So, this doesn't help much, if at all.

The compiler knowing what type of reference a particular LOAD_ATTR gets doesn't help, because we aren't going to duplicate all ten or so specializations of LOAD_ATTR.

gvanrossum commented 1 year ago

Whoops, my comment above is a plan to work on dynamic tagged refcount handling, so should really go in #443.

carljm commented 1 year ago

@markshannon

We are talking about the bytecode compiler, not a JIT compiler. So, this doesn't help much, if at all.

Agreed, because in a bytecode interpreter, the cost of extra instructions is greater than the cost of extra refcount operations.

The compiler knowing what type of reference a particular LOAD_ATTR gets doesn't help, because we aren't going to duplicate all ten or so specializations of LOAD_ATTR.

It helps if you have explicit INCREF and DECREF instructions that the compiler can insert into the instruction stream, because then there's no need to duplicate all the specializations of LOAD_ATTR, which was the point of my message. But it's a moot point for the above reason, unless someone can think of a different clever way to encode this information per-instruction that's lower cost than an extra instruction, and lower cost than a refcount operation. (And effectively, tagging pointers at runtime is one such way.)

gvanrossum commented 1 year ago

The compiler knowing what type of reference a particular LOAD_ATTR gets doesn't help, because we aren't going to duplicate all ten or so specializations of LOAD_ATTR.

But we could. Most of the tag checks and manipulations are automatically inserted by the code generator. We could have the code generator produce two versions of selected bytecodes, one that expects tags and one that doesn't. We are currently at around 170 opcodes, so we could easily accommodate a dozen extra LOAD_ATTR variants and two dozen CALL variants, if it would help. (And yes, this would reduce code locality in the interpreter, so it's not a slam dunk, but we needn't dismiss it just because there are 10 specializations of LOAD_ATTR.)

faster-cpython / ideas

Static reduction in reference count operations #444