faster-cpython / ideas

1.68k stars 48 forks source link

Deferred reference counts #443

Closed mdboom closed 7 months ago

mdboom commented 2 years ago

(From discussion 390).

From stats gathered from running the test suite, the average object sees 6 increfs and 7 decrefs over its lifetime. There are about 1.1 increfs and 1.3 decrefs per bytecode instruction executed.

These seem to be split about 55%/45% between the interpreter and all other code ("Interpreter" includes frame and generator operations)

[TO DO: Get stats from standard benchmark suite]

Each incref/decref pair removed saves four memory reads or writes, which is the same as instruction dispatch for 3.11 (we expect to reduce instruction dispatch to three memory accesses for 3.12).

Some of these reference count modifications are necessary as they represent references from the heap. However, it should be possible to eliminate many counts for references from the stack.

There are a number of places to look that should show improvements.

Deferred reference counts

Suppose our static analysis shows that the local a outlives the expression x = a + b. We could then, emit bytecode that does not incref a when pushing it to the stack, and does not decref it when a + b has been computed. The reference to a on the stack would be deferred.

gvanrossum commented 2 years ago

Okay, so to summarize deferring references with a concrete example, suppose we see

# a is a local, b is a global
a = a + b

Let's say today this produces this bytecode:

LOAD_FAST a
LOAD_GLOBAL b
BINARY_OP +
STORE_FAST a

The idea is that we add new opcodes so this can be written as

LOAD_FAST_BORROW a  # Doesn't INCREF a
LOAD_GLOBAL b
BINARY_OP_XXX +  # Doesn't DECREF left arg
STORE_FAST a

The code in ceval.c for LOAD_FAST is

        TARGET(LOAD_FAST) {
            PyObject *value = GETLOCAL(oparg);
            assert(value != NULL);
            Py_INCREF(value);  // <-------- mark
            PUSH(value);
            DISPATCH();

So for LOAD_FAST_BORROW we would drop the marked INCREF line.

Similar, for BINARY_OP the code is:

        TARGET(BINARY_OP) {
            PREDICTED(BINARY_OP);
            PyObject *rhs = POP();
            PyObject *lhs = TOP();
            assert(0 <= oparg);
            assert((unsigned)oparg < Py_ARRAY_LENGTH(binary_ops));
            assert(binary_ops[oparg]);
            PyObject *res = binary_ops[oparg](lhs, rhs);
            Py_DECREF(lhs);  // <----------------- mark
            Py_DECREF(rhs);
            SET_TOP(res);
            if (res == NULL) {
                goto error;
            }
            JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP);
            DISPATCH();
        }

Again the marked line would be deleted.

It does seem like there would be a proliferation of new opcodes, since potentially each BINARY_OP_... specialization would also need two versions. And we might want to avoid the refcount for the right arg, or for both args...

I'd worry that the resulting dilution of the interpreter code might outweigh the beneficial effect of skipping the refcounts. We'd definitely want to have some way of automatically deriving the borrowing versions from the standard versions (i.e., generating the interpreter code -- I'm not keen on more macros).

mkbosmans commented 2 years ago

Is there a bit available in the pointer on stack to mark a skipped incref? That way you could instead of Py_INCREF(value); do something like value |= 0x1; in LOAD_FAST_BORROW. Then BINARY_OP can check for that bit and skip the decref.

That would only have separate LOAD_FAST variants and keep only a single one per BINARY_OP. So you have no increased code and icache pressure, but that is traded in by an extra branch for lhs and rhs..

mkbosmans commented 2 years ago

Oh, I just read #444. I don't understand that description enough however to determine whether what is being discussed there is the same as my idea here, or merely something related.

gvanrossum commented 2 years ago

Yeah, it's the same idea. Have a bit on each stack entry that indicates whether it's a borrowed reference.

lpereira commented 2 years ago

Just as a side note, which isn't that important to this discussion but it's something to be aware of: if using the upper bits to store the tag, rather than the lower bit, we might be able to use features such as Linear Address Masking, where the CPU can basically ignore those bits and we wouldn't need to unmask these pointers when trying to access them.

lpereira commented 2 years ago

Half-baked idea, if using the least-significant bit to tag a borrowed reference. Instead of using Py_INCREF() inside the interpreter, use something else:

static inline PyObject *untag(PyObject *ob)
{
    return (PyObject *)((uintptr_t)ob & ~1ull);
}

static inline PyObject *untag_maybe_ref(PyObject *ob)
{
     /* This untags a pointer and increments the reference count if it
      * was tagged as a borrowed reference, branch-free */
     PyObject *untagged_ob = untag(ob);
     untagged_ob->ob_refcnt += ((uintptr_t)ob) & 1ull;
     return untagged_ob;
}

static inline void maybe_unref(PyObject *ob, PyObject *untagged_ob)
{
    if (ob == untagged_ob)
        Py_DECREF(ob);
}

And then use it in the interpreter loop like this (I removed the asserts for clarity, but they're the same as the original code):

        TARGET(BINARY_OP) {
            PREDICTED(BINARY_OP);
            PyObject *rhs = SECOND();
            PyObject *lhs = TOP();
            PyObject *rhs_untagged = untag_maybe_ref(rhs);
            PyObject *lhs_untagged = untag_maybe_ref(lhs);
            PyObject *res = binary_ops[oparg](lhs_untagged, rhs_untagged);
            maybe_unref(lhs, lhs_untagged);
            maybe_unref(rhs, rhs_untagged);
            DROP(); /* #define DROP() --stack_pointer */
            SET_TOP(res);
            if (res == NULL) {
                goto error;
            }
            JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP);
            DISPATCH();
        }

Advantages of this:

Disadvantages:

Other remarks:

markshannon commented 1 year ago

From the latest stats, we have: Total increfs: 98 billion Total decrefs: 108 billion Total bytecode instructions executed: 92 billion, of which LOAD_FAST, LOAD_CONST and superinstructions of those: ~33 billion

There are another roughly 25 billion increfs that are attributed to the interpreter. It is not clear where those are, but it is possible that many of those could be deferred as well.

This suggests that using tagged pointers in the interpreter should be a win. We could potentially save ~50% of increfs (and the matching decrefs) at the cost of an additional branch when we would have decremented those deferred references.

There is also the potential for further improvements using static analysis. For those operations where we expect most operands to be non-deferred we can, in the compiler, ensure that all operands are non-deferred, avoiding the additional check.

gvanrossum commented 1 year ago

I've written up some thoughts in #444 on how we would actually do this.

markshannon commented 1 year ago

I'd like to elaborate on my ideas, a bit. So that we are all on the same page.

Naming

Maybe "deferred reference" counting is the wrong term. In the academic literature, the term implies that the GC is responsible for tracking refcounts on the stack. That could work, but it does change the semantics of reclamation, probably more than we want to.

So let's call it "tagged references". Since the implementation will be by tagging pointers. The exact scheme doesn't matter too much, but the idea is that some references are counted internally on the object (using ob_refcnt) and some are counted externally on the references to an object, by tagging the references.

The difference is that the references are not deferred, they are as immediate as always, but that the accounting has changed.

On other words, all references are still references, but the accounting differs.

Invariants

With these invariants, we can de-allocate an object when its internal refcount reaches zero, as that implies the total refcount is also zero.

Enforcing invariants

To enforce the invariants we need to obey the following rule: For an externally accounted reference to exist, there must exist an internally accounted reference to the same object that will outlive it.

Stacks have the property that things lower down the stack remain on the stack for longer than things higher up. So, we can use the properties of the stack frame to determine which references can be externally accounted.

We can do static analysis in the compiler to determine which stack references are outlived by the local variable from which they come. Most will have this property and we can replace the LOAD_FAST with a LOAD_FAST_TAGGED_REF``. LOAD_CONST` can always tag the reference, as the consts will outlast any value the on the stack.

Tagged references can be transferred from caller to callee (so no additional checking is needed when passing arguments to a Python function, but cannot be stored in the heap, or passed (by return or yield) from callee to caller.

We will need a bit of extra code when updating the locals after modification in the debugger to cache the old values for the lifetime of the frame.

jneb commented 1 year ago

It would be useful to estimate of switching to a register based architecture influences the efficiency gain of this. I cannot guess if using registers makes this optimization more useful or less.

gvanrossum commented 1 year ago

It would probably be less effective with a register based architecture. Take

a = b + c

Currently this is

LOAD_FAST (b)
LOAD_FAST (c)
BINARY_OP (+)
STORE_FAST (a)

and there are INCREF calls in the two LOAD opcodes matched by DECREF calls in BINARY_OP.

With a register architecture this would be

BINARY_OP (+) (b) (c) (a)

and there would be no need for those INCREF and DECREF calls.

Alas, the dynamic version of tagged refcounts can probably be done with relatively few changes to the interpreter, whereas the register machine would take months to complete that we don't have in the 3.12 schedule (beta 1, the feature cut-off, is late May).

gvanrossum commented 1 year ago

Preliminary results of adding a whole lot of code (mostly automated) that strips tag bits when read from the stack (but no code that actually sets a tag bit):

I am pondering my next move.

lpereira commented 1 year ago
  • It's 2% slower in benchmarks

I would not worry about this regression at this point, since this experiment doesn't paint the whole picture. More work is being performed during this experiment, which is likely to negatively effect the benchmarks. I suspect that, in the long run, when the reference management overhead is reduced, this 2% will be of no consequence.

BTW, you mentioned in today's meeting about maybe needing more bits. It's fine to use up to 3 bits if the objects are allocated in an 8-byte alignment boundary. I wouldn't use the higher bits as that's not very future-proof (e.g. x86_64 used to use only the lower 48 bits, but with 5-level paging, that's up to 57 bits at this point).

gvanrossum commented 1 year ago

BTW, you mentioned in today's meeting about maybe needing more bits.

In the end it seems we don't. The extra reference counts are distributed over potentially many "variables". Each variable has a bit (I'm using the low bit) indicating that it represents an extra reference. E.g. if we have

a = b + b

then this might translate into

LOAD_FAST (b)
LOAD_FAST (b)
BINARY_OP (+)
STORE_FAST (a)

At the start of BINARY_OP we have these locals and stack variables:

a: <object pointer B>
b: NULL
stack[0]: A|1
stack[1]: A|1

So the total number of references to B is 3: one from variable b that is internally counted (i.e. B->ob_refcnt == 1), one from stack[1], and one from stack[1]. But neither stack entry represents more than one reference.

The BINARY_OP implementation calls something like PyObject_Add(left, right), where it must strip the tag bits off. It can then just drop the stack entries without calling Py_DECREF() on them, because the tag bits are set. At that point the total number of references to B is 1 again.

A tricky issue is that PyObject_Add() might see B->ob_refcnt == 1 and modify the object in place even if there's a totally unrelated fourth reference to B. So we need to be more careful with such optimizations (there are a few).

Another (more immediate) problem for me is that I have to make sure that if there is code like

return a

with corresponding bytecode

LOAD_FAST (a)
RETURN_VALUE

the RETURN_VALUE instruction must not just strip the tag bit from the pointer -- it must increment ob_refcnt since the object escapes from the frame. Similar for bytecodes involved in implementing yield and await.

There's also stuff (alluded to by Mark) around tracing, which we might have to fix by just going over all stack and local values and doing the same thing to each one: if there's a tag bit, strip it and call Py_INCREF() on it. This would slow down tracing. Since I'm just trying to get a prototype that I can benchmark, I'll consciously not worry about that yet (it's a distraction for something we might never even do in the end).

gvanrossum commented 1 year ago

Having worked on this for ~2 weeks in https://github.com/python/cpython/pull/102539 I am ready to give up. Getting this to work has been quite a slog, I'm stumped on why I am losing references to None (probably due to an untagging bug in one of the async or exception handling opcodes, see comment), I haven't started tagging in LOAD_FAST (it immediately crashes), and I've not done anything for specializations of LOAD_ATTR and CALL, which are the biggest potential wins. IOW I don't expect to get this in a state where a benchmark can show improvement. So I'm abandoning this now.

markshannon commented 1 year ago

The possibility of NoGIL, or something similar, makes this worth reconsidering (and other approaches that reduce the number of reference counting operations).

Checking a bit in the pointer is race free and scales perfectly.

markshannon commented 7 months ago

Superseded by https://github.com/faster-cpython/ideas/issues/632