faster-cpython / ideas

1.67k stars 49 forks source link

Investigation: Why do some benchmarks have significantly increased memory usage with the new incremental GC? #668

Open mdboom opened 3 months ago

mdboom commented 3 months ago

Looking at the new incremental GC and its subsequent bugfix together, it's interesting that some benchmarks use significantly more memory. In particular, mypy2, pycparser and docutils.

Data ![image](https://github.com/faster-cpython/ideas/assets/38294/19f35e20-fea8-428c-8ece-3556d7d17b6d) | benchmark | base | head | diff | | --- | ---: | ---: | ---: | | 2to3 | 23,279,323 | 23,574,625 | 295,302 | | aiohttp | 30,047,865 | 29,992,862 | -55,003 | | async_generators | 36,232,825 | 36,214,101 | -18,724 | | async_tree_none | 80,446,220 | 80,228,937 | -217,283 | | async_tree_cpu_io_mixed | 83,602,090 | 83,824,640 | 222,550 | | async_tree_cpu_io_mixed_tg | 92,690,529 | 93,186,925 | 496,396 | | async_tree_io | 136,706,145 | 135,888,700 | -817,445 | | async_tree_io_tg | 145,445,644 | 146,071,747 | 626,103 | | async_tree_memoization | 86,078,610 | 85,938,371 | -140,239 | | async_tree_memoization_tg | 95,327,768 | 95,374,189 | 46,421 | | async_tree_none_tg | 89,782,369 | 89,443,961 | -338,408 | | asyncio_tcp | 50,583,454 | 51,326,390 | 742,936 | | asyncio_tcp_ssl | 76,057,648 | 76,106,020 | 48,372 | | asyncio_websockets | 65,070,811 | 64,648,728 | -422,083 | | chameleon | 29,368,905 | 29,676,300 | 307,395 | | chaos | 17,821,110 | 17,829,107 | 7,997 | | comprehensions | 19,322,782 | 19,364,717 | 41,935 | | bench_mp_pool | 20,654,762 | 20,830,890 | 176,128 | | coroutines | 17,117,379 | 17,019,270 | -98,109 | | crypto_pyaes | 17,455,981 | 17,514,691 | 58,710 | | dask | 88,305,273 | 88,550,448 | 245,175 | | deepcopy | 18,866,956 | 18,866,176 | -780 | | deepcopy_reduce | 18,834,578 | 18,855,643 | 21,065 | | deepcopy_memo | 18,891,727 | 18,836,918 | -54,809 | | deltablue | 19,162,648 | 21,385,216 | 2,222,568 | | django_template | 43,785,069 | 43,736,697 | -48,372 | | djangocms | 30,294,016 | 30,260,467 | -33,549 | | docutils | 64,943,835 | 85,793,840 | 20,850,005 | | dulwich_log | 26,350,348 | 25,903,299 | -447,049 | | fannkuch | 17,007,957 | 16,951,393 | -56,564 | | float | 34,400,158 | 34,406,009 | 5,851 | | create_gc_cycles | 17,081,685 | 17,093,973 | 12,288 | | gc_traversal | 20,627,651 | 20,528,566 | -99,085 | | generators | 29,539,571 | 29,482,227 | -57,344 | | genshi_text | 20,681,094 | 21,016,576 | 335,482 | | genshi_xml | 21,943,832 | 22,484,114 | 540,282 | | go | 20,013,641 | 18,889,191 | -1,124,450 | | gunicorn | 30,099,553 | 29,992,667 | -106,886 | | hexiom | 18,173,561 | 18,145,475 | -28,086 | | html5lib | 37,427,687 | 41,330,590 | 3,902,903 | | json | 18,928,006 | 18,950,241 | 22,235 | | json_dumps | 17,250,986 | 17,288,630 | 37,644 | | json_loads | 17,275,172 | 17,305,600 | 30,428 | | logging_format | 24,717,994 | 24,727,942 | 9,948 | | logging_silent | 17,715,395 | 17,760,256 | 44,861 | | logging_simple | 23,280,688 | 23,272,691 | -7,997 | | mako | 26,486,881 | 26,608,396 | 121,515 | | mdp | 25,583,225 | 25,669,827 | 86,602 | | meteor_contest | 19,434,739 | 19,383,442 | -51,297 | | mypy2 | 167,246,116 | 242,955,605 | 75,709,489 | | nbody | 17,175,308 | 17,166,336 | -8,972 | | nqueens | 17,238,308 | 17,207,491 | -30,817 | | pathlib | 18,659,035 | 18,834,383 | 175,348 | | pickle | 17,929,362 | 18,064,140 | 134,778 | | pickle_dict | 17,877,869 | 17,895,228 | 17,359 | | pickle_list | 17,998,799 | 17,963,885 | -34,914 | | pickle_pure_python | 17,842,371 | 17,920,975 | 78,604 | | pidigits | 17,161,264 | 17,136,883 | -24,381 | | pprint_safe_repr | 37,872,591 | 37,859,718 | -12,873 | | pprint_pformat | 37,840,798 | 37,846,454 | 5,656 | | pycparser | 39,668,784 | 46,512,420 | 6,843,636 | | pyflate | 60,085,199 | 59,582,366 | -502,833 | | pylint | 92,368,310 | 95,753,752 | 3,385,442 | | python_startup | 13,590,332 | 13,637,729 | 47,397 | | python_startup_no_site | 13,584,871 | 13,620,175 | 35,304 | | raytrace | 17,892,108 | 17,909,662 | 17,554 | | regex_dna | 25,022,854 | 25,080,393 | 57,539 | | regex_effbot | 17,865,971 | 17,899,520 | 33,549 | | regex_v8 | 23,517,086 | 23,460,327 | -56,759 | | richards | 17,694,720 | 17,702,912 | 8,192 | | richards_super | 17,723,392 | 17,667,413 | -55,979 | | scimark_fft | 18,076,428 | 18,055,168 | -21,260 | | scimark_lu | 17,692,769 | 17,755,769 | 63,000 | | scimark_monte_carlo | 18,058,873 | 18,048,926 | -9,947 | | scimark_sparse_mat_mult | 18,079,353 | 18,216,667 | 137,314 | | spectral_norm | 16,931,693 | 17,008,932 | 77,239 | | sqlglot_normalize | 22,803,407 | 22,787,218 | -16,189 | | sqlglot_optimize | 23,573,845 | 23,881,045 | 307,200 | | sqlglot_parse | 22,958,665 | 25,704,350 | 2,745,685 | | sqlglot_transpile | 23,045,461 | 24,079,993 | 1,034,532 | | sqlite_synth | 19,792,457 | 19,782,509 | -9,948 | | sympy_expand | 59,473,139 | 59,392,975 | -80,164 | | sympy_integrate | 60,309,504 | 60,277,126 | -32,378 | | sympy_sum | 72,773,632 | 72,791,771 | 18,139 | | sympy_str | 60,266,203 | 60,242,602 | -23,601 | | telco | 17,062,180 | 17,114,453 | 52,273 | | thrift | 18,817,609 | 18,836,528 | 18,919 | | tomli_loads | 180,438,162 | 177,827,254 | -2,610,908 | | tornado_http | 41,430,649 | 41,497,356 | 66,707 | | typing_runtime_protocols | 19,174,156 | 19,157,772 | -16,384 | | unpack_sequence | 18,567,753 | 18,615,929 | 48,176 | | unpickle | 18,123,239 | 18,082,474 | -40,765 | | unpickle_list | 18,007,186 | 18,036,248 | 29,062 | | unpickle_pure_python | 17,914,343 | 17,938,529 | 24,186 | | xml_etree_parse | 22,025,362 | 22,069,443 | 44,081 | | xml_etree_iterparse | 22,826,227 | 22,949,107 | 122,880 | | xml_etree_generate | 23,017,569 | 23,610,514 | 592,945 | | xml_etree_process | 22,720,121 | 22,722,462 | 2,341 |

Looking at the mypy2 benchmark alone, I measured how memory usage increases as you increase the size of the input data (I basically feed it N copies of the input file rather than a single file).

On the legacy GC, memory usage scales pretty linearly with the size of the input. With the incremental GC, memory usage appears to grow faster than linear. but certainly at a much faster rate. The two lines are normalized so it's (memory of N files / memory of 1 file).

image

#files incremental gc legacy gc
1 144 136
2 149 137
5 155 138
10 192 142
mdboom commented 3 months ago

I ran this again now that some follow-on PRs have be merged and with more data points. Here I am comparing a recent commit python/cpython@c1712ef to the commit prior to the incremental GC python/cpython@d5ebf8b71fd18d7a1f2f6b670a2c18749dc2b55e.

Now, there is almost no difference at smaller data sizes. As the data size increases, however, you can see that the legacy GC memory increases at a much smaller rate than the incremental GC. In fact, I think the "stair stepped" appearance of the legacy GC probably supports @markshannon's suggestion that the legacy GC did "too much" work. More importantly, though, the incremental GC doesn't appear to increase memory usage in any sort of faster-than-linear rate as the input size increases, suggesting there isn't any issue with cycles that are never found/"leaked" or anything of that sort. So this does seem like a tradeoff that is reasonable to make, rather than a sign of any bug causing runaway memory usage.

image

Data size Incremental GC Legacy GC
1 109 MB 109 MB
2 118 MB 118 MB
4 126 MB 125 MB
8 187 MB 139 MB
16 244 MB 193 MB
24 335 MB 267 MB
32 435 MB 270 MB
48 619 MB 451 MB