The cycle GC takes about 10% of the runtime on our benchmark suite.
This is already far too high, and with NoGIL it will get much worse.
Currently, we perform an entire heap scan every N collections (currently N == 100).
Scanning the entire heap is bad for throughput and even worse for latency.
So instead of scanning the entire heap at once, we should scan the old generation incrementally.
Here's how it could work (this untested and unproven)
Two generations: young and old.
Split the old generations into two spaces: pending and scanned.
Perform alternating young and old collections*
Survivors of the young collection are moved to the end of start of the scanned section
The old collection works as follows:
objects_per_collection = 10_000 # This will need to adapt to rate at which objects survive the young collection
if not pending:
pending, scanned = scanned, pending
if not pending:
return
work = [ pending.popleft() ]
objects_per_collection -= 1
while True:
for obj in work: # Do not check for mutation, as we want to mutate work.
for child in obj.tp_visit:
if child in scanned:
continue
DECREF(child)
if child not in work:
work.append(child)
objects_per_collection -= 1
if not pending or objects_per_collection <= 0:
break
work.append(pending.popleft())
objects_per_collection -= 1
reachable = clear_cycles_and_restore_refcounts(work)
scanned.extend(reachable)
clear_cycles_and_restore_refcounts(work) is already part of the cycle GC, so we aren't really adding much complexity.
The above algorithm has (I believe) two key properties:
It is guaranteed to collect all cycles enventually
It visits no more objects than the current algorithm: In the worse case of the entire object graph being reachable from the first object looked at, then all objects are visited, but the current old space collection does that anyway.
[*] Instead of doing an incremental collection every 2 collections, we could do it every N collections. If the incremental collections visit too many objects, i.e. much more than work_per_collection, we can increase N.
The cycle GC takes about 10% of the runtime on our benchmark suite. This is already far too high, and with NoGIL it will get much worse.
Currently, we perform an entire heap scan every N collections (currently N == 100). Scanning the entire heap is bad for throughput and even worse for latency.
So instead of scanning the entire heap at once, we should scan the old generation incrementally.
Here's how it could work (this untested and unproven)
young
andold
.pending
andscanned
.scanned
sectionclear_cycles_and_restore_refcounts(work)
is already part of the cycle GC, so we aren't really adding much complexity.The above algorithm has (I believe) two key properties:
[*] Instead of doing an incremental collection every 2 collections, we could do it every N collections. If the incremental collections visit too many objects, i.e. much more than
work_per_collection
, we can increase N.