Fewer collections by automatically adjusting the GC threshold

Initially, the idea came from Optimal heap limits for reducing browser memory use.

If cyclic garbage collection always does not occur at execution time, reference counting can complete free memory. Then we would want the threshold to increase and reduce the number of GC occurrences.

Correspondingly, if most of the dealloc is done by gc, then the threshold should be lowered to reduce the number of objects to be traversed by gc each time.

After discussions with the authors of the paper, we modified the formula in the paper, to support python cyclic garbage collection.

M = L + sqrt(L + C)
M: Heap limit
L: Live memory
C: Constant

The variable M can be replaced by the total number of gc objects. Accordingly, the variable L is replaced by the number of objects that survived the last GC. Then the expression M - L actually represents the number of objects added between GC cycles, i.e. the threshold of generation 0.

gcstate->generations[0].threshold = sqrt(L) + C

Finally, to reduce the number of GC threshold adjustments, we use long_lived_pending instead of the right L. This means looking at each completed generations[1] GC as a phase and making a tweak.

The final formula we have is gcstate->generations[0].threshold = sqrt(gcstate->long_lived_pending) + C. The variable C is a constant that needs to be covariant, which in my experiments is 11.

# pyperformance==1.0.6
python3 -m pyperf compare_to --table --table-format md --min-speed 2 3.12.json gc.json

Benchmark	3.12	gc
async_tree_io	1.43 sec	780 ms: 1.84x faster
async_tree_memoization	779 ms	480 ms: 1.62x faster
async_tree_none	641 ms	411 ms: 1.56x faster
async_tree_cpu_io_mixed	908 ms	649 ms: 1.40x faster
float	82.0 ms	77.0 ms: 1.07x faster
xml_etree_generate	89.6 ms	85.1 ms: 1.05x faster
spectral_norm	115 ms	111 ms: 1.04x faster
html5lib	74.9 ms	72.3 ms: 1.04x faster
docutils	2.66 sec	2.57 sec: 1.04x faster
pickle	13.1 us	12.7 us: 1.03x faster
xml_etree_process	61.2 ms	59.6 ms: 1.03x faster
unpickle	15.2 us	14.8 us: 1.03x faster
deepcopy_reduce	3.33 us	3.24 us: 1.03x faster
regex_dna	203 ms	199 ms: 1.02x faster
json_dumps	9.87 ms	9.65 ms: 1.02x faster
chaos	81.7 ms	83.4 ms: 1.02x slower
fannkuch	436 ms	446 ms: 1.02x slower
raytrace	328 ms	336 ms: 1.02x slower
coroutines	31.1 ms	32.0 ms: 1.03x slower
richards	48.8 ms	50.3 ms: 1.03x slower
chameleon	7.40 ms	7.70 ms: 1.04x slower
scimark_fft	365 ms	381 ms: 1.04x slower
mdp	3.17 sec	3.33 sec: 1.05x slower
crypto_pyaes	82.8 ms	88.0 ms: 1.06x slower
json_loads	27.0 us	28.7 us: 1.06x slower
mako	9.74 ms	10.5 ms: 1.08x slower
deepcopy_memo	34.6 us	38.5 us: 1.11x slower
logging_silent	93.0 ns	111 ns: 1.19x slower
Geometric mean	(ref)	1.02x faster

From the pyperformance, async is the most rewarding case, which is to be expected, since the asynchronous case generates a lot of generator objects.

Also, benchmarks with the tag 'serialize' get a general boost, which may have something to do with the fact that they need to generate more objects.	Benchmark	3.12
xml_etree_generate	89.6 ms	85.1 ms: 1.05x faster
pickle	13.1 us	12.7 us: 1.03x faster
xml_etree_process	61.2 ms	59.6 ms: 1.03x faster
unpickle	15.2 us	14.8 us: 1.03x faster
json_dumps	9.87 ms	9.65 ms: 1.02x faster
json_loads	27.0 us	28.7 us: 1.06x slower
Geometric mean	(ref)	1.01x faster

Most of the case changes are under 2%, which I think is an acceptable change.

https://github.com/python/cpython/compare/main...penguin-wwy:cpython:main_gc_test2

Hi @penguin-wwy. Limiting GC collections is something I am currently working on. Thanks for mentioning here your approach. This is a similar idea as what we currently do to avoid cuadratic behaviour in the last generation.

This approach is similar to one of the things I am benchmarking currently which is dynamic thresholds. The idea is the same but what modifies the generation thresholds is different: I am currently trying other things like the proportion of new objects to the proportion of dead objects, the rate of the GC calls (in time) and a different formula relating new objects with current objects (similar to the one in the old generation).

In general, I think this is something that should be applied to all generations, not only the first one.

The other problem I am trying to fight is to avoid doing ton of collections when a huge amount of objects are created in a short span of time. That's when checking the gc rate in time can prove useful because the test of the metrics I can think of fail in one case or the other.

Finally, the last approach I am benchmarking is no generations. It has been suggested that Python doesn't follow the weak generational hypothesis because most objects die for reference count so in theory having one single generation with the current heuristic that we have right now should be more performant.

Of course measuring any of this is very difficult because the GC is very idiosyncratic and depends heavily on the program you are currently executing.

Regarding your formula, I have several things I am not too sure to apply correctly.

In your comment you say:

If cyclic garbage collection always does not occur at execution time, reference counting can complete free memory. Then we would want the threshold to increase and reduce the number of GC occurrences. Correspondingly, if most of the dealloc is done by gc, then the threshold should be lowered to reduce the number of objects to be traversed by gc each time.

But the formula that you are using doesn't measure any of these. You are using the number of objects that come into the GC, but that doesn't measure the ratio of objects cleaned by refcount to objects cleaned by the gc, it measures how many objects have survived a GC pass. It doesn't say anything on how many of these surviving objects will be cleaned by refcount or by the GC. This is also not taking into account also that there are tons of objects that are not tracked by the GC

First: the formula in the paper you are providing is derived based on the rate of memory allocated but you are translating that to the number of objects. This assumes that every GC object contributes to the total memory at the same rate, which is not necessarily true.
There are aspects of the model that do not apply completely to CPython: in CPython there are two mechanisms that affect ALL objects: reference count and gc. The paper assumes that the GC is the sole way objects are deallocated but in CPython the situation is more complicated because only cycles are the ones that get deallocated by the GC, but all objects in the GC and not in the GC also participate in reference counting. This mixed approach makes the computation more complicated because the runtime includes both gc and not gc allocations and measuring the allocation rate for pure cycles is impossible.
The other thing is that the formula that you are applying finally is sqrt(long_lived_pending;) * 11. This to me is relatively unjustified. Why long_lived_pending is a good measure of L*g/c/s? Why 11? The paper goes into detail on how g, s and L should be measured in section 5:

Our proposed approach, which we call MemBalancer, therefore measures 𝐿 and 𝑠 on the main thread after every major garbage collection, recording 𝐿 and 𝑠 and recomputing the heap limit 𝑀. Major garbage collections also measure and update 𝑔 to avoid measuring 𝑔 across major GC boundaries. As a result, heap limits are updated at least once per second, or more often if garbage collections are frequent.

The also say:

MemBalancer therefore smooths the new allocation data using an exponentially weighted moving average, which estimates a value as a weighted linear combination of the current value and the last estimate. For 𝑔, a smoothing constant of 𝛼𝑔 = 0.95 is used, so that the half-life of any data point is about thirteen seconds, which reduces jitter enough to prevent spurious garbage collections but still allows the rate 𝑔 to respond rapidly to changes in program behavior. Similarly, the measured 𝑠 is smoothed, with a smoothing constant of 𝛼𝑠 = 0.5, to account for anomalously long garbage collections. Both 𝑔 and 𝑠 are measured by dividing a number of bytes by a duration; as is standard, we smooth the top and bottom individually.

But somehow all of that is ignored and the final formula is sqrt(long_lived_pending;) * 11?

Regarding pyperformance, I can technically make it run even faster by using sqrt(long_lived_pending;) * arbitrary_big_number which smoothly deactivates the GC. Indeed, there is no correlation between the number of cycles collected and the current gc limit. For instance in async_trees these are the growth rates and the collection ratio (rates in the percentage of total objects collected):

ew gc threshold: 891
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 2.359882
Sucess ratio: 0.728450
Sucess ratio: 7.075471
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 2.097902
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 1265
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 22.963537
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 1771
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 2409
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3069
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3740
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 4422
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 5115
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 5819
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 6523
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 7238
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3289
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3960
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 4653
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 6.827382
Sucess ratio: 49.068794
Sucess ratio: 32.848980

@pablogsal Thank you for your question and I will try to clarify my thoughts.

First: the formula in the paper you are providing is derived based on the rate of memory allocated but you are translating that to the number of objects. This assumes that every GC object contributes to the total memory at the same rate, which is not necessarily true.

Python's GC design is very different from the JVM or v8. Most objects are free naturally due to reference counting. For cyclic GC it is essentially the identification of objects that cannot die naturally, which can be seen as two parts: determination of ring formation + free the object. The overhead of free the object during this process cannot be compressed or eliminated, which is consistent with the natural free case. My idea is to reduce the cost per gc, the part determination of ring formation.

So many variables in the initial formula are omitted, because they have no room for optimization for the part determination of ring formation, such as the allocation rate. In cyclic gc, the cost of traversing objects depends only on the number of references they have. an object with a large memory allocation and an object with a small memory allocation are both equal when traversing.

There are aspects of the model that do not apply completely to CPython: in CPython there are two mechanisms that affect ALL objects: reference count and gc. The paper assumes that the GC is the sole way objects are deallocated but in CPython the situation is more complicated because only cycles are the ones that get deallocated by the GC, but all objects in the GC and not in the GC also participate in reference counting.

Generation 0 threshold raising allows more objects to be destroyed naturally, without the need for gc traversal.

As an example, if the object keeps being allocated (e.g. adding new elements to the collection in a loop), then a threshold of 700 for two traversals and a threshold of 1400 for one traversal will result in the same total number of traversals in both ways. Without considering the function call, it can be assumed that their overheads are the same.

But in reality, the larger the threshold will make more objects will be released naturally. So if the threshold is 1400, one gc traversal will be triggered at a program point. Then if the threshold is 700, the number of gc traversals triggered at the same program point will be greater than or equal to 2, which means that more than 1400 objects are involved in the gc traversal process. If the threshold is set to a very large value(e.g. INT_MAX), then the effect is equivalent to gc.disable()

This can be very simply confirmed in cases such as async_tree by adding a gc call_back function to count the number of gc calls. For example, on async_tree_io, the number of gc calls is 5781 vs 592

The other thing is that the formula that you are applying finally is sqrt(long_lived_pending;) 11. This to me is relatively unjustified. Why long_lived_pending is a good measure of Lg/c/s? Why 11? The paper goes into detail on how g, s and L should be measured in section 5:

By the above analysis, sqrt(L*g/c/s) is replaced with sqrt(L/c), furthermore sqrt(L) * sqrt(1/c), since c is a constant, it is simplified to sqrt(L) * C. L was replaced with long_lived_pending.

Why 11?

11 is the result of my many tests. I tried multiple numbers from 10 to 30 as this constant and 11 was the best representation.

Finally, after discussing with the author, I think the core of this formula is sqrt(L) * C. This means that when L < C**2, it grows faster, which will reduce the number of times GC is triggered. When L > C ** 2, it makes the limit grow relatively slower (compared to a direct 1.x * limit), used to reduce the cost of GC itself (traversing relatively fewer objects) at the next GC.

I honestly don't understand what this is trying to achieve. I would point out several things:

CPython and the CPython GC do not have any notion of a heap limit
the memory footprint vs. CPU runtime tradeoff is workload-specific; CPython cannot decide for the user which one is to be favored (which is why GC tunables are exposed: some use cases prefer timely collection, some use cases prefer CPU time conservation)
runtime results of short-lived benchmarks are mostly pointless when tuning the GC (perhaps disabling the GC would make those benchmarks even faster...); the most relevant use cases for the CPython GC are complex long-lived applications

We've tuned this in Instagram with a focus both on reducing time spent without increasing latency too much. The numbers we came up with are 80000, 25, and 10.

faster-cpython / ideas

Fewer collections by automatically adjusting the GC threshold #523