faster-cpython / ideas

1.67k stars 49 forks source link

Fewer collections by automatically adjusting the GC threshold #523

Open penguin-wwy opened 1 year ago

penguin-wwy commented 1 year ago

Initially, the idea came from Optimal heap limits for reducing browser memory use.

If cyclic garbage collection always does not occur at execution time, reference counting can complete free memory. Then we would want the threshold to increase and reduce the number of GC occurrences.

Correspondingly, if most of the dealloc is done by gc, then the threshold should be lowered to reduce the number of objects to be traversed by gc each time.

After discussions with the authors of the paper, we modified the formula in the paper, to support python cyclic garbage collection.

M = L + sqrt(L + C)
M: Heap limit
L: Live memory
C: Constant

The variable M can be replaced by the total number of gc objects. Accordingly, the variable L is replaced by the number of objects that survived the last GC. Then the expression M - L actually represents the number of objects added between GC cycles, i.e. the threshold of generation 0.

gcstate->generations[0].threshold = sqrt(L) + C

Finally, to reduce the number of GC threshold adjustments, we use long_lived_pending instead of the right L. This means looking at each completed generations[1] GC as a phase and making a tweak.

The final formula we have is gcstate->generations[0].threshold = sqrt(gcstate->long_lived_pending) + C. The variable C is a constant that needs to be covariant, which in my experiments is 11.

# pyperformance==1.0.6
python3 -m pyperf compare_to --table --table-format md --min-speed 2 3.12.json gc.json
Benchmark 3.12 gc
async_tree_io 1.43 sec 780 ms: 1.84x faster
async_tree_memoization 779 ms 480 ms: 1.62x faster
async_tree_none 641 ms 411 ms: 1.56x faster
async_tree_cpu_io_mixed 908 ms 649 ms: 1.40x faster
float 82.0 ms 77.0 ms: 1.07x faster
xml_etree_generate 89.6 ms 85.1 ms: 1.05x faster
spectral_norm 115 ms 111 ms: 1.04x faster
html5lib 74.9 ms 72.3 ms: 1.04x faster
docutils 2.66 sec 2.57 sec: 1.04x faster
pickle 13.1 us 12.7 us: 1.03x faster
xml_etree_process 61.2 ms 59.6 ms: 1.03x faster
unpickle 15.2 us 14.8 us: 1.03x faster
deepcopy_reduce 3.33 us 3.24 us: 1.03x faster
regex_dna 203 ms 199 ms: 1.02x faster
json_dumps 9.87 ms 9.65 ms: 1.02x faster
chaos 81.7 ms 83.4 ms: 1.02x slower
fannkuch 436 ms 446 ms: 1.02x slower
raytrace 328 ms 336 ms: 1.02x slower
coroutines 31.1 ms 32.0 ms: 1.03x slower
richards 48.8 ms 50.3 ms: 1.03x slower
chameleon 7.40 ms 7.70 ms: 1.04x slower
scimark_fft 365 ms 381 ms: 1.04x slower
mdp 3.17 sec 3.33 sec: 1.05x slower
crypto_pyaes 82.8 ms 88.0 ms: 1.06x slower
json_loads 27.0 us 28.7 us: 1.06x slower
mako 9.74 ms 10.5 ms: 1.08x slower
deepcopy_memo 34.6 us 38.5 us: 1.11x slower
logging_silent 93.0 ns 111 ns: 1.19x slower
Geometric mean (ref) 1.02x faster

From the pyperformance, async is the most rewarding case, which is to be expected, since the asynchronous case generates a lot of generator objects.

Also, benchmarks with the tag 'serialize' get a general boost, which may have something to do with the fact that they need to generate more objects. Benchmark 3.12 gc
xml_etree_generate 89.6 ms 85.1 ms: 1.05x faster
pickle 13.1 us 12.7 us: 1.03x faster
xml_etree_process 61.2 ms 59.6 ms: 1.03x faster
unpickle 15.2 us 14.8 us: 1.03x faster
json_dumps 9.87 ms 9.65 ms: 1.02x faster
json_loads 27.0 us 28.7 us: 1.06x slower
Geometric mean (ref) 1.01x faster

Most of the case changes are under 2%, which I think is an acceptable change.

https://github.com/python/cpython/compare/main...penguin-wwy:cpython:main_gc_test2

corona10 commented 1 year ago

cc @pablogsal

pablogsal commented 1 year ago

Hi @penguin-wwy. Limiting GC collections is something I am currently working on. Thanks for mentioning here your approach. This is a similar idea as what we currently do to avoid cuadratic behaviour in the last generation.

This approach is similar to one of the things I am benchmarking currently which is dynamic thresholds. The idea is the same but what modifies the generation thresholds is different: I am currently trying other things like the proportion of new objects to the proportion of dead objects, the rate of the GC calls (in time) and a different formula relating new objects with current objects (similar to the one in the old generation).

In general, I think this is something that should be applied to all generations, not only the first one.

The other problem I am trying to fight is to avoid doing ton of collections when a huge amount of objects are created in a short span of time. That's when checking the gc rate in time can prove useful because the test of the metrics I can think of fail in one case or the other.

Finally, the last approach I am benchmarking is no generations. It has been suggested that Python doesn't follow the weak generational hypothesis because most objects die for reference count so in theory having one single generation with the current heuristic that we have right now should be more performant.

Of course measuring any of this is very difficult because the GC is very idiosyncratic and depends heavily on the program you are currently executing.

pablogsal commented 1 year ago

Regarding your formula, I have several things I am not too sure to apply correctly.

But the formula that you are using doesn't measure any of these. You are using the number of objects that come into the GC, but that doesn't measure the ratio of objects cleaned by refcount to objects cleaned by the gc, it measures how many objects have survived a GC pass. It doesn't say anything on how many of these surviving objects will be cleaned by refcount or by the GC. This is also not taking into account also that there are tons of objects that are not tracked by the GC

Our proposed approach, which we call MemBalancer, therefore measures 𝐿 and 𝑠 on the main thread after every major garbage collection, recording 𝐿 and 𝑠 and recomputing the heap limit 𝑀. Major garbage collections also measure and update 𝑔 to avoid measuring 𝑔 across major GC boundaries. As a result, heap limits are updated at least once per second, or more often if garbage collections are frequent.

The also say:

MemBalancer therefore smooths the new allocation data using an exponentially weighted moving average, which estimates a value as a weighted linear combination of the current value and the last estimate. For 𝑔, a smoothing constant of 𝛼𝑔 = 0.95 is used, so that the half-life of any data point is about thirteen seconds, which reduces jitter enough to prevent spurious garbage collections but still allows the rate 𝑔 to respond rapidly to changes in program behavior. Similarly, the measured 𝑠 is smoothed, with a smoothing constant of 𝛼𝑠 = 0.5, to account for anomalously long garbage collections. Both 𝑔 and 𝑠 are measured by dividing a number of bytes by a duration; as is standard, we smooth the top and bottom individually.

But somehow all of that is ignored and the final formula is sqrt(long_lived_pending;) * 11?

ew gc threshold: 891
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 2.359882
Sucess ratio: 0.728450
Sucess ratio: 7.075471
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 2.097902
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 1265
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 22.963537
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 1771
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 2409
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3069
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3740
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 4422
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 5115
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 5819
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 6523
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 7238
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3289
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 3960
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
New gc threshold: 4653
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 0.000000
Sucess ratio: 6.827382
Sucess ratio: 49.068794
Sucess ratio: 32.848980
penguin-wwy commented 1 year ago

@pablogsal Thank you for your question and I will try to clarify my thoughts.

First: the formula in the paper you are providing is derived based on the rate of memory allocated but you are translating that to the number of objects. This assumes that every GC object contributes to the total memory at the same rate, which is not necessarily true.

Python's GC design is very different from the JVM or v8. Most objects are free naturally due to reference counting. For cyclic GC it is essentially the identification of objects that cannot die naturally, which can be seen as two parts: determination of ring formation + free the object. The overhead of free the object during this process cannot be compressed or eliminated, which is consistent with the natural free case. My idea is to reduce the cost per gc, the part determination of ring formation.

So many variables in the initial formula are omitted, because they have no room for optimization for the part determination of ring formation, such as the allocation rate. In cyclic gc, the cost of traversing objects depends only on the number of references they have. an object with a large memory allocation and an object with a small memory allocation are both equal when traversing.

There are aspects of the model that do not apply completely to CPython: in CPython there are two mechanisms that affect ALL objects: reference count and gc. The paper assumes that the GC is the sole way objects are deallocated but in CPython the situation is more complicated because only cycles are the ones that get deallocated by the GC, but all objects in the GC and not in the GC also participate in reference counting.

Generation 0 threshold raising allows more objects to be destroyed naturally, without the need for gc traversal.

As an example, if the object keeps being allocated (e.g. adding new elements to the collection in a loop), then a threshold of 700 for two traversals and a threshold of 1400 for one traversal will result in the same total number of traversals in both ways. Without considering the function call, it can be assumed that their overheads are the same.

But in reality, the larger the threshold will make more objects will be released naturally. So if the threshold is 1400, one gc traversal will be triggered at a program point. Then if the threshold is 700, the number of gc traversals triggered at the same program point will be greater than or equal to 2, which means that more than 1400 objects are involved in the gc traversal process. If the threshold is set to a very large value(e.g. INT_MAX), then the effect is equivalent to gc.disable()

This can be very simply confirmed in cases such as async_tree by adding a gc call_back function to count the number of gc calls. For example, on async_tree_io, the number of gc calls is 5781 vs 592

The other thing is that the formula that you are applying finally is sqrt(long_lived_pending;) 11. This to me is relatively unjustified. Why long_lived_pending is a good measure of Lg/c/s? Why 11? The paper goes into detail on how g, s and L should be measured in section 5:

By the above analysis, sqrt(L*g/c/s) is replaced with sqrt(L/c), furthermore sqrt(L) * sqrt(1/c), since c is a constant, it is simplified to sqrt(L) * C. L was replaced with long_lived_pending.

Why 11?

11 is the result of my many tests. I tried multiple numbers from 10 to 30 as this constant and 11 was the best representation.

Finally, after discussing with the author, I think the core of this formula is sqrt(L) * C. This means that when L < C**2, it grows faster, which will reduce the number of times GC is triggered. When L > C ** 2, it makes the limit grow relatively slower (compared to a direct 1.x * limit), used to reduce the cost of GC itself (traversing relatively fewer objects) at the next GC.

pitrou commented 1 year ago

I honestly don't understand what this is trying to achieve. I would point out several things:

DinoV commented 1 year ago

We've tuned this in Instagram with a focus both on reducing time spent without increasing latency too much. The numbers we came up with are 80000, 25, and 10.