Deferred reference counts.

markshannon commented 6 months ago

CPython uses a combination of naive reference counting and a stack machine. This results in a lot of reference counting operations that have a major impact on the performance of the JIT and interpreter.

We can reduce the cost of reference counting hugely by using deferred reference counting.

With deferred reference counting, we do not explicitly count references from the active part of the frame stack (in both local variables and the evaluation stack). Instead, we only count those references during GC.

Something like 70% of reference count operations occur in the interpreter. This fraction will vary, but if we start moving code from C to Python for performance and maintainability reasons, it will likely increase.

In order to implement deferred reference counting, we will need new C types for the stack references, some work on this has already been done.

Reference counts

Currently we maintain an exact reference count for each object in that object's header. This is the counted reference count: counted RC == true RC With deferred reference counts some references will not be counted in the object's header, but will be deferred: counted RC + deferred RC == true RC Note that, since no reference count can be negative, the true reference count can only be zero when both the deferred RC and the counted RC are zero.

This mean that we need a way to keep a handle to all objects whose counted RC reaches zero, so that we can collect them when we know that all deferred RCs are zero. We will probably implement this using what is known as a Zero Count Table (ZCT). Any object whose counted RC reaches zero, or is assigned to a stack variable when created will be added to the ZCT. During GC, when all deferred RCs are zero, any objects in the ZCT with a RC of zero can be reclaimed.

Prompt reclamation.

One important advantage of reference counting is prompt reclamation; there is not a long pause between an object becoming unreachable and it getting reclaimed.

In practice, it is not prompt reclamation of objects that matters but prompt reclamation of resources, primarily memory. Rather than tracking the number of objects allocated to determine when GC should occur, we should track the total memory allocated. If we are allocating multi-megabyte objects, GC should occur very frequently.

The algorithm

Allocation

Upon allocation of an object:

Decrease the nursery size by the size of the object
If the nursery size drops to zero or below, then schedule GC
If the reference is to be deferred (such that the counted RC is zero) then add the new object to the ZCT

GC

A GC collection should do the following:

Make all deferred references immediate by scanning all stacks and any generator/coroutines with deferred references.
Deallocate all object in the ZCT table with a RC of zero.
Perform incremental GC as normal, removing any collected objects from the ZCT.
Set the nursery size back to maximum.

Details

Any thread that calls into C code or suspends execution must store the stack pointer so that the GC can scan the stack.
It must be possible to add objects to, and remove objects from, the ZCT very quickly.

Handling references

We will need to distinguish between heap references and stack references. Stack references to an object contribute to the deferred RC of that object. Heap references to an object contribute to the counted RC of that object.

For all code except the GC, we can use the C compiler to help us out. We can define two opaque structs for stack and heap references

/* Stack reference */
typedef struct _stackref  PySRef;

/* Heap reference */
typedef struct _heapref PyHRef;

Because we cannot cast a PyHRef to a PySRef or vice-versa, we must do through function calls (provided we are disciplined enough to not extract the bits directly).

The function calls can correctly adjust the reference counts, putting objects in the ZCT if necessary. In debug mode we can use a bit as a dynamic marker to double-check that we are doing things correctly.

All interpreter frames, including generator and coroutine frames, must contain all deferred RCs or all immediate RCs. Executing frames must contain deferred RCs. The frame will need a bit to indicate whether references are deferred or immediate.

Lowering the cost of stack scanning

At every GC pause, which might be frequent, we need to make sure that the deferred reference count of all objects is zero. To do that we need to scan the entirety of all stacks. That could be expensive. To avoid having to scan the entire stack, we treat the lower part (furthest way from the current frame) as heap references and the top part (including the current frame) as stack references. The current frame must always be deferred, so we insert a special frame between the heap and stack parts to convert when a RETURN or YIELD would otherwise start executing a frame of heap references.

As we can rely on the special frame to convert from heap to stack references lazily, we don't need to perform that conversion in the GC. The GC does need to convert stack to heap references, so at the end of a GC collection, all thread stacks will have a special frame as their current frame

Structures

In order to get the most help from the C compiler converting plain PyObject * pointers into references suitable for supporting deferred reference counting and tagged ints, we need to define some opaque structs:

We should use the dup/close model of HPy for references, as it allows automatic finding of refcount errors which will be a boon for development.

The API

inline PyObject *PyHRef_To_PyObject_New(PyHRef);

static inline PyObject *PyHRef_To_PyObject_Steal(PyHRef href) {
    PyObject *res = PyHRef_To_PyObject_New(href);
    PyHRef_Close(href);
}

inline PyHRef PyHRef_Dup(PyHRef);
inline void PyHRef_Close(PyHRef);

inline PyObject *PySRef_To_PyObject_New(PySRef);

static inline PyObject *PySRef_To_PyObject_Steal(PySRef sref) {
    PyObject *res = PySRef_To_PyObject_New(sref);
    PySRef_Close(sref);
}

inline PySRef PySRef_Clone(PySRef);
inline void PySRef_Close(PySRef);

inline PySRef PyObject_To_SRef_New(PyObject *);

inline PySRef PyObject_To_SRef_Steal(PyObject *obj) {
    PySRef res = PyObject_To_SRef_New(obj);
    Py_DECREF(obj);
    return res;
}

inline PyHRef PyObject_To_HRef_New(PyObject *);

inline PyHRef PyObject_To_HRef_Steal(PyObject *) {
    PyHRef res = PyObject_To_HRef_New(obj);
    Py_DECREF(obj);
    return res;
}

All the Steal variants are defined as above, but may be implemented more efficiently.

Generators and coroutines

Generators and coroutines are executable objects containing references to other objects, like frames, but they are not on the frame stack; they are on the heap.

Here are three ways we can handle this:

All references in the generator's frame are stack references. This keeps execution and the compiler simple as generator frames are no different to normal frame. We do, however, need to track all generators so that the GC can convert the references when needed. With many live coroutines, this could get expensive.
All references in the generator's frame are heap references. This requires us to duplicate all instructions, which would be infeasible, or have the compiler insert additional reference count instructions where needed. With reasonable static analysis, this could perform reasonably well, but adds a lot of complexity to the compiler
All local variables are heap references, but all evaluation stack entries are stack references. This requires a special version of STORE_FAST, but other instructions are unchanged. The compiler will need to either, make sure the stack is empty when suspended, or make sure that all values on the stack are also present in a (counted) local variable. The latter is probably more efficient.

Of these, I think 3. is the best. It is reasonably efficient, and can be implemented and tested prior to the rest of the work on deferred references.

Tagging

Since references on the stack can be either stack references or heap references, it will be relatively easy to get the type wrong. To help prevent that we can tag references. In theory no tags should be necessary, and we could just do this for debug builds. However, if we are to support tagged ints, we will need tagging anyway, so we might as well use it for all code.

For consistency with tagged ints, we want 0 to mean "not a heap reference", i.e. stack reference (or tagged int), leading to the following tagging scheme:

Tag	Meaning
00	Reserved
01	Stack reference pointer (value = ptr+1)
10	Reserved
11	Heap reference pointer (value = ptr-1)

With tagged ints, we get this scheme: Tag	Meaning
00	Unboxed int
01	Stack reference pointer (value = ptr+1)
10	Reserved
11	Heap reference pointer (value = ptr-1)

gvanrossum commented 6 months ago

Shouldn't we need to consider the ideas the Facebook team (e.g. @colesbury, @DinoV) has for deferred reference counting too?

Fidget-Spinner commented 6 months ago

I agree with everything and all the naming scheme except calling newref "dup" and decref "close". I don't think it's worth it to add additional mental overhead at the moment to CPython. CPython core devs know when it's right to decref, and when it's right to incref. The new names throw them off for something that currently have nearly exact same semantics as incref/decref for no additional clarity/benefit.

markshannon commented 6 months ago

Dup and close are not new and decref. See HPy handles and HPy debug mode for more information about the difference and why it is important.

Given we are changing all uses of references in the interpreter, it would be a shame to lose the opportunity to make this improvement.

I think it is a bit presumptive to speak for all core devs, but the number of refcounting bugs that have needed fixing over the years shows that at least some core devs (myself included) make mistakes with refcounting.

Fidget-Spinner commented 6 months ago

Dup and close are not new and decref. See HPy handles and HPy debug mode for more information about the difference and why it is important.

Which part of the docs is that? The docs mention:

On CPython ... HPy_Dup() and HPy_Close() are just aliases for Py_INCREF and Py_DECREF

gvanrossum commented 6 months ago

When this was discussed in person I had not read or understood the proposal. Having now read it through, I am very worried about this, in terms of how disturbing it will be to the rest of the CPython code base (object implementations, stdlib extension modules) as well as 3rd party extensions (not counting the Limited API).

IIUC every DECREF in the world will change its behavior so that if the refcount reaches zero the object is added to the ZCT, because we can never be sure that there aren't also stack references to an object. This feels like it will break too many cases where code makes (not clearly stated) assumptions about object/resource lifetimes. Many cases will be hard to find or not under our control (3rd party extensions).

Doing GC more frequently would increase the frequency of stop the world events, right? That sounds like it could be potentially detrimental to free-threading performance.

I wonder if one potential improvement would be if the ZCT entries were to be considered heap references, so that an object that only exists in the ZCT (plus potential stack references) would have a reference count of one, not zero. This would do nothing about lifetimes, but at least it would make it impossible to observe live objects with a zero refcount. For the rest of the algorithms around the ZCT it would make no difference.

Basically, DECREF(x) would become

if (x->ob_refcnt == 1) {
    ...add x to ZCT without changing its refcount...
} else {
    assert(x->ob_refcnt > 1);
    if (!_Py_IsImmortal(x)) {
        x->ob_refcnt -= 1;
    }
}

This does nothing to absolve my worries about lifetimes or free-threading performance, though.

markshannon commented 6 months ago

IIUC every DECREF in the world will change its behavior so that if the refcount reaches zero the object is added to the ZCT.

Yes, but that isn't necessarily a problem.

In the interpreter, almost all Py_DECREFs are removed
If a call escapes the interpreter, then we save and restore the stack pointer around the call. That way if the ZCT grows too large (and we can set the limit how we like) as a result of a Py_DECREF then all the unreachable objects will be freed.

Just to be clear, it is _Py_Dealloc that will be changed, not Py_DECREF.

This feels like it will break too many cases where code makes (not clearly stated) assumptions about object/resource lifetimes.

A program cannot depend on unreachable objects, as they are unreachable. So object lifetimes isn't an issue.

Resource lifetimes do matter, though. That is why we want to base the frequency of collections on the size of allocations, not their number. This equates memory with resources, so for other resources, like sockets, we might need to pretend that an object is much larger than it really is, in order to prompt rapid reclamation.

And don't forget that reference cycles already introduce arbitrary delays in freeing resources.

I wonder if one potential improvement would be if the ZCT entries were to be considered heap references, so that an object that only exists in the ZCT (plus potential stack references) would have a reference count of one, not zero.

That make sense. We would want to change Py_DECREF in that case, to avoid having to decrement the reference count, just increment it again to put it in the ZCT.

markshannon commented 6 months ago

Doing GC more frequently would increase the frequency of stop the world events, right? That sounds like it could be potentially detrimental to free-threading performance.

All the more reason to implement incremental collection for free-threading.

markshannon commented 6 months ago

A few examples of how removing the refcounts improves the JIT stencils:

_LOAD_FAST_0

From 7 to 3 instructions

Main:

// 0: 49 8b 45 48                   movq    0x48(%r13), %rax
// 4: 8b 08                         movl    (%rax), %ecx
// 6: ff c1                         incl    %ecx
// 8: 74 02                         je      0xc <_JIT_ENTRY+0xc>
// a: 89 08                         movl    %ecx, (%rax)
// c: 48 89 45 00                   movq    %rax, (%rbp)
// 10: 48 83 c5 08                  addq    $0x8, %rbp

Deferred RC:

// 0: 49 8b 45 48                   movq    0x48(%r13), %rax
// 4: 48 89 45 00                   movq    %rax, (%rbp)
// 8: 48 83 c5 08                   addq    $0x8, %rbp

_STORE_FAST_0:

From 17 to 3 instructions!

Main:

// 0: 50                            pushq   %rax
// 1: 48 8b 45 f8                   movq    -0x8(%rbp), %rax
// 5: 48 83 c5 f8                   addq    $-0x8, %rbp
// 9: 49 8b 7d 48                   movq    0x48(%r13), %rdi
// d: 49 89 45 48                   movq    %rax, 0x48(%r13)
// 11: 48 85 ff                     testq   %rdi, %rdi
// 14: 74 0f                        je      0x25 <_JIT_ENTRY+0x25>
// 16: 48 8b 07                     movq    (%rdi), %rax
// 19: 85 c0                        testl   %eax, %eax
// 1b: 78 08                        js      0x25 <_JIT_ENTRY+0x25>
// 1d: 48 ff c8                     decq    %rax
// 20: 48 89 07                     movq    %rax, (%rdi)
// 23: 74 07                        je      0x2c <_JIT_ENTRY+0x2c>
// 25: 58                           popq    %rax
// 26: ff 25 00 00 00 00            jmpq    *(%rip)                 # 0x2c
// 2c: ff 15 00 00 00 00            callq   *(%rip)                 # 0x32
// 32: 58                           popq    %rax

Deferred RC:

// 0: 48 8b 45 f8                   movq    -0x8(%rbp), %rax
// 4: 48 83 c5 f8                   addq    $-0x8, %rbp
// 8: 49 89 45 48                   movq    %rax, 0x48(%r13)

_LOAD_ATTR_INSTANCE_VALUE_0:

From 24 to 6 instructions

Main:

// 0: 50                            pushq   %rax
// 1: 48 8b 7d f8                   movq    -0x8(%rbp), %rdi
// 5: 0f b7 05 00 00 00 00          movzwl  (%rip), %eax            # 0xc
// c: 48 8b 5c c7 18                movq    0x18(%rdi,%rax,8), %rbx
// 11: 48 85 db                     testq   %rbx, %rbx
// 14: 74 22                        je      0x38 <_JIT_ENTRY+0x38>
// 16: 8b 03                        movl    (%rbx), %eax
// 18: ff c0                        incl    %eax
// 1a: 74 02                        je      0x1e <_JIT_ENTRY+0x1e>
// 1c: 89 03                        movl    %eax, (%rbx)
// 1e: 48 8b 07                     movq    (%rdi), %rax
// 21: 85 c0                        testl   %eax, %eax
// 23: 78 08                        js      0x2d <_JIT_ENTRY+0x2d>
// 25: 48 ff c8                     decq    %rax
// 28: 48 89 07                     movq    %rax, (%rdi)
// 2b: 74 12                        je      0x3f <_JIT_ENTRY+0x3f>
// 2d: 48 89 5d f8                  movq    %rbx, -0x8(%rbp)
// 31: 58                           popq    %rax
// 32: ff 25 00 00 00 00            jmpq    *(%rip)                 # 0x38
// 38: 58                           popq    %rax
// 39: ff 25 00 00 00 00            jmpq    *(%rip)                 # 0x3f
// 3f: ff 15 00 00 00 00            callq   *(%rip)                 # 0x45
// 45: 48 89 5d f8                  movq    %rbx, -0x8(%rbp)
// 49: 58                           popq    %rax

Deferred RC:

// 0: 48 8b 45 f8                   movq    -0x8(%rbp), %rax
// 4: 0f b7 0d 00 00 00 00          movzwl  (%rip), %ecx            # 0xb
// b: 48 8b 44 c8 18                movq    0x18(%rax,%rcx,8), %rax
// 10: 48 85 c0                     testq   %rax, %rax
// 13: 74 0a                        je      0x1f <_JIT_ENTRY+0x1f>
// 15: 48 89 45 f8                  movq    %rax, -0x8(%rbp)

Fidget-Spinner commented 6 months ago

Wait where are the instructions that test for the tag in the above JIT stencils? Or are you suggesting removing all decref/incref in these instructions? Or maybe these just for illustrative purposes?

The alternative route we can do is refcount analysis like what Cinder does in their JIT. We can create stencil variants automatically that do not decref/incref inputs. For example, if something has a refcount of 1, we don't incref it again, and similarly don't decref it until there's something that makes it 0.

Fidget-Spinner commented 6 months ago

Wait where are the instructions that test for the tag in the above JIT stencils? Or are you suggesting removing all decref/incref in these instructions?

Nevermind, after reading your scheme more closely, you're not using the tag for deferred stuff.

iritkatriel commented 6 months ago

Fourth option for Generators and coroutines:

The stack of a suspended frame owns references to anything on it (so YIELD increfs everything on the stack and on resume we decref).

brandtbucher commented 6 months ago

A few examples of how removing the refcounts improves the JIT stencils:

And combined with stack caching, _LOAD_FAST_0 and _STORE_FAST_0 will each become a single instruction, and _LOAD_ATTR_INSTANCE_VALUE_0 will become only four. That's exciting.

markshannon commented 5 months ago

A couple more points:

Until we need/want tagged integers, there is no need for any tags at all. Converting from a stack refs to a PyObject * has no runtime cost.
Deferring all stack references likely reduces the number of contended reference count operations by an even larger fraction that total reference count operations, meaning that immortalization becomes less necessary for performance.

zhuguangxiang commented 4 days ago

when python call c code, how and where to save object pointers which are defined in c code?

zhuguangxiang commented 4 days ago

Another question is how to handle deferred RC in multi-thread case without GIL? I think we need to consider more about RC or DRC and GIL together.

faster-cpython / ideas