Open markshannon opened 6 months ago
Shouldn't we need to consider the ideas the Facebook team (e.g. @colesbury, @DinoV) has for deferred reference counting too?
I agree with everything and all the naming scheme except calling newref "dup" and decref "close". I don't think it's worth it to add additional mental overhead at the moment to CPython. CPython core devs know when it's right to decref, and when it's right to incref. The new names throw them off for something that currently have nearly exact same semantics as incref/decref for no additional clarity/benefit.
Dup and close are not new and decref. See HPy handles and HPy debug mode for more information about the difference and why it is important.
Given we are changing all uses of references in the interpreter, it would be a shame to lose the opportunity to make this improvement.
I think it is a bit presumptive to speak for all core devs, but the number of refcounting bugs that have needed fixing over the years shows that at least some core devs (myself included) make mistakes with refcounting.
Dup and close are not new and decref. See HPy handles and HPy debug mode for more information about the difference and why it is important.
Which part of the docs is that? The docs mention:
On CPython ... HPy_Dup() and HPy_Close() are just aliases for Py_INCREF and Py_DECREF
When this was discussed in person I had not read or understood the proposal. Having now read it through, I am very worried about this, in terms of how disturbing it will be to the rest of the CPython code base (object implementations, stdlib extension modules) as well as 3rd party extensions (not counting the Limited API).
IIUC every DECREF in the world will change its behavior so that if the refcount reaches zero the object is added to the ZCT, because we can never be sure that there aren't also stack references to an object. This feels like it will break too many cases where code makes (not clearly stated) assumptions about object/resource lifetimes. Many cases will be hard to find or not under our control (3rd party extensions).
Doing GC more frequently would increase the frequency of stop the world events, right? That sounds like it could be potentially detrimental to free-threading performance.
I wonder if one potential improvement would be if the ZCT entries were to be considered heap references, so that an object that only exists in the ZCT (plus potential stack references) would have a reference count of one, not zero. This would do nothing about lifetimes, but at least it would make it impossible to observe live objects with a zero refcount. For the rest of the algorithms around the ZCT it would make no difference.
Basically, DECREF(x) would become
if (x->ob_refcnt == 1) {
...add x to ZCT without changing its refcount...
} else {
assert(x->ob_refcnt > 1);
if (!_Py_IsImmortal(x)) {
x->ob_refcnt -= 1;
}
}
This does nothing to absolve my worries about lifetimes or free-threading performance, though.
IIUC every DECREF in the world will change its behavior so that if the refcount reaches zero the object is added to the ZCT.
Yes, but that isn't necessarily a problem.
Py_DECREF
s are removedIf a call escapes the interpreter, then we save and restore the stack pointer around the call. That way if the ZCT grows too large (and we can set the limit how we like) as a result of a Py_DECREF
then all the unreachable objects will be freed.
Just to be clear, it is _Py_Dealloc
that will be changed, not Py_DECREF
.
This feels like it will break too many cases where code makes (not clearly stated) assumptions about object/resource lifetimes.
A program cannot depend on unreachable objects, as they are unreachable. So object lifetimes isn't an issue.
Resource lifetimes do matter, though. That is why we want to base the frequency of collections on the size of allocations, not their number. This equates memory with resources, so for other resources, like sockets, we might need to pretend that an object is much larger than it really is, in order to prompt rapid reclamation.
And don't forget that reference cycles already introduce arbitrary delays in freeing resources.
I wonder if one potential improvement would be if the ZCT entries were to be considered heap references, so that an object that only exists in the ZCT (plus potential stack references) would have a reference count of one, not zero.
That make sense. We would want to change Py_DECREF
in that case, to avoid having to decrement the reference count, just increment it again to put it in the ZCT.
Doing GC more frequently would increase the frequency of stop the world events, right? That sounds like it could be potentially detrimental to free-threading performance.
All the more reason to implement incremental collection for free-threading.
A few examples of how removing the refcounts improves the JIT stencils:
From 7 to 3 instructions
Main:
// 0: 49 8b 45 48 movq 0x48(%r13), %rax
// 4: 8b 08 movl (%rax), %ecx
// 6: ff c1 incl %ecx
// 8: 74 02 je 0xc <_JIT_ENTRY+0xc>
// a: 89 08 movl %ecx, (%rax)
// c: 48 89 45 00 movq %rax, (%rbp)
// 10: 48 83 c5 08 addq $0x8, %rbp
Deferred RC:
// 0: 49 8b 45 48 movq 0x48(%r13), %rax
// 4: 48 89 45 00 movq %rax, (%rbp)
// 8: 48 83 c5 08 addq $0x8, %rbp
From 17 to 3 instructions!
Main:
// 0: 50 pushq %rax
// 1: 48 8b 45 f8 movq -0x8(%rbp), %rax
// 5: 48 83 c5 f8 addq $-0x8, %rbp
// 9: 49 8b 7d 48 movq 0x48(%r13), %rdi
// d: 49 89 45 48 movq %rax, 0x48(%r13)
// 11: 48 85 ff testq %rdi, %rdi
// 14: 74 0f je 0x25 <_JIT_ENTRY+0x25>
// 16: 48 8b 07 movq (%rdi), %rax
// 19: 85 c0 testl %eax, %eax
// 1b: 78 08 js 0x25 <_JIT_ENTRY+0x25>
// 1d: 48 ff c8 decq %rax
// 20: 48 89 07 movq %rax, (%rdi)
// 23: 74 07 je 0x2c <_JIT_ENTRY+0x2c>
// 25: 58 popq %rax
// 26: ff 25 00 00 00 00 jmpq *(%rip) # 0x2c
// 2c: ff 15 00 00 00 00 callq *(%rip) # 0x32
// 32: 58 popq %rax
Deferred RC:
// 0: 48 8b 45 f8 movq -0x8(%rbp), %rax
// 4: 48 83 c5 f8 addq $-0x8, %rbp
// 8: 49 89 45 48 movq %rax, 0x48(%r13)
From 24 to 6 instructions
Main:
// 0: 50 pushq %rax
// 1: 48 8b 7d f8 movq -0x8(%rbp), %rdi
// 5: 0f b7 05 00 00 00 00 movzwl (%rip), %eax # 0xc
// c: 48 8b 5c c7 18 movq 0x18(%rdi,%rax,8), %rbx
// 11: 48 85 db testq %rbx, %rbx
// 14: 74 22 je 0x38 <_JIT_ENTRY+0x38>
// 16: 8b 03 movl (%rbx), %eax
// 18: ff c0 incl %eax
// 1a: 74 02 je 0x1e <_JIT_ENTRY+0x1e>
// 1c: 89 03 movl %eax, (%rbx)
// 1e: 48 8b 07 movq (%rdi), %rax
// 21: 85 c0 testl %eax, %eax
// 23: 78 08 js 0x2d <_JIT_ENTRY+0x2d>
// 25: 48 ff c8 decq %rax
// 28: 48 89 07 movq %rax, (%rdi)
// 2b: 74 12 je 0x3f <_JIT_ENTRY+0x3f>
// 2d: 48 89 5d f8 movq %rbx, -0x8(%rbp)
// 31: 58 popq %rax
// 32: ff 25 00 00 00 00 jmpq *(%rip) # 0x38
// 38: 58 popq %rax
// 39: ff 25 00 00 00 00 jmpq *(%rip) # 0x3f
// 3f: ff 15 00 00 00 00 callq *(%rip) # 0x45
// 45: 48 89 5d f8 movq %rbx, -0x8(%rbp)
// 49: 58 popq %rax
Deferred RC:
// 0: 48 8b 45 f8 movq -0x8(%rbp), %rax
// 4: 0f b7 0d 00 00 00 00 movzwl (%rip), %ecx # 0xb
// b: 48 8b 44 c8 18 movq 0x18(%rax,%rcx,8), %rax
// 10: 48 85 c0 testq %rax, %rax
// 13: 74 0a je 0x1f <_JIT_ENTRY+0x1f>
// 15: 48 89 45 f8 movq %rax, -0x8(%rbp)
Wait where are the instructions that test for the tag in the above JIT stencils? Or are you suggesting removing all decref/incref in these instructions? Or maybe these just for illustrative purposes?
The alternative route we can do is refcount analysis like what Cinder does in their JIT. We can create stencil variants automatically that do not decref/incref inputs. For example, if something has a refcount of 1, we don't incref it again, and similarly don't decref it until there's something that makes it 0.
Wait where are the instructions that test for the tag in the above JIT stencils? Or are you suggesting removing all decref/incref in these instructions?
Nevermind, after reading your scheme more closely, you're not using the tag for deferred stuff.
Fourth option for Generators and coroutines:
A few examples of how removing the refcounts improves the JIT stencils:
And combined with stack caching, _LOAD_FAST_0
and _STORE_FAST_0
will each become a single instruction, and _LOAD_ATTR_INSTANCE_VALUE_0
will become only four. That's exciting.
A couple more points:
PyObject *
has no runtime cost.when python call c code, how and where to save object pointers which are defined in c code?
Another question is how to handle deferred RC in multi-thread case without GIL? I think we need to consider more about RC or DRC and GIL together.
CPython uses a combination of naive reference counting and a stack machine. This results in a lot of reference counting operations that have a major impact on the performance of the JIT and interpreter.
We can reduce the cost of reference counting hugely by using deferred reference counting.
With deferred reference counting, we do not explicitly count references from the active part of the frame stack (in both local variables and the evaluation stack). Instead, we only count those references during GC.
Something like 70% of reference count operations occur in the interpreter. This fraction will vary, but if we start moving code from C to Python for performance and maintainability reasons, it will likely increase.
In order to implement deferred reference counting, we will need new C types for the stack references, some work on this has already been done.
Reference counts
Currently we maintain an exact reference count for each object in that object's header. This is the counted reference count:
counted RC == true RC
With deferred reference counts some references will not be counted in the object's header, but will be deferred:counted RC + deferred RC == true RC
Note that, since no reference count can be negative, the true reference count can only be zero when both the deferred RC and the counted RC are zero.This mean that we need a way to keep a handle to all objects whose counted RC reaches zero, so that we can collect them when we know that all deferred RCs are zero. We will probably implement this using what is known as a Zero Count Table (ZCT). Any object whose counted RC reaches zero, or is assigned to a stack variable when created will be added to the ZCT. During GC, when all deferred RCs are zero, any objects in the ZCT with a RC of zero can be reclaimed.
Prompt reclamation.
One important advantage of reference counting is prompt reclamation; there is not a long pause between an object becoming unreachable and it getting reclaimed.
In practice, it is not prompt reclamation of objects that matters but prompt reclamation of resources, primarily memory. Rather than tracking the number of objects allocated to determine when GC should occur, we should track the total memory allocated. If we are allocating multi-megabyte objects, GC should occur very frequently.
The algorithm
Allocation
Upon allocation of an object:
GC
A GC collection should do the following:
Details
Handling references
We will need to distinguish between heap references and stack references. Stack references to an object contribute to the deferred RC of that object. Heap references to an object contribute to the counted RC of that object.
For all code except the GC, we can use the C compiler to help us out. We can define two opaque structs for stack and heap references
Because we cannot cast a
PyHRef
to aPySRef
or vice-versa, we must do through function calls (provided we are disciplined enough to not extract the bits directly).The function calls can correctly adjust the reference counts, putting objects in the ZCT if necessary. In debug mode we can use a bit as a dynamic marker to double-check that we are doing things correctly.
All interpreter frames, including generator and coroutine frames, must contain all deferred RCs or all immediate RCs. Executing frames must contain deferred RCs. The frame will need a bit to indicate whether references are deferred or immediate.
Lowering the cost of stack scanning
At every GC pause, which might be frequent, we need to make sure that the deferred reference count of all objects is zero. To do that we need to scan the entirety of all stacks. That could be expensive. To avoid having to scan the entire stack, we treat the lower part (furthest way from the current frame) as heap references and the top part (including the current frame) as stack references. The current frame must always be deferred, so we insert a special frame between the heap and stack parts to convert when a
RETURN
orYIELD
would otherwise start executing a frame of heap references.As we can rely on the special frame to convert from heap to stack references lazily, we don't need to perform that conversion in the GC. The GC does need to convert stack to heap references, so at the end of a GC collection, all thread stacks will have a special frame as their current frame
Structures
In order to get the most help from the C compiler converting plain
PyObject *
pointers into references suitable for supporting deferred reference counting and tagged ints, we need to define some opaque structs:We should use the dup/close model of HPy for references, as it allows automatic finding of refcount errors which will be a boon for development.
The API
All the
Steal
variants are defined as above, but may be implemented more efficiently.Generators and coroutines
Generators and coroutines are executable objects containing references to other objects, like frames, but they are not on the frame stack; they are on the heap.
Here are three ways we can handle this:
STORE_FAST
, but other instructions are unchanged. The compiler will need to either, make sure the stack is empty when suspended, or make sure that all values on the stack are also present in a (counted) local variable. The latter is probably more efficient.Of these, I think 3. is the best. It is reasonably efficient, and can be implemented and tested prior to the rest of the work on deferred references.
Tagging
Since references on the stack can be either stack references or heap references, it will be relatively easy to get the type wrong. To help prevent that we can tag references. In theory no tags should be necessary, and we could just do this for debug builds. However, if we are to support tagged ints, we will need tagging anyway, so we might as well use it for all code.
For consistency with tagged ints, we want 0 to mean "not a heap reference", i.e. stack reference (or tagged int), leading to the following tagging scheme: