Determine and document how to deal with modifying benchmarks

mdboom commented 1 year ago

The motivating thing here is that the mypy benchmark in pyston-macrobenchmarks is arguably buggy. It has far greater variation in the measurements than other benchmarks:

The reason for this is that it runs mypy over some files 20 times -- the first time it actually loads the files from disk, and subsequent times they are read from an in-memory cache in mypy itself. (To be clear, I'm not talking about the effects of any OS-level filesystem caching that may be going on). The fix is easy---ignore the first run through the loop. The problem is that if we fix this, we will artificially "improve" our results -- the fixed version of the benchmark is around 2x faster than the old one.

(A) I think the safest thing to do is to rename the benchmark (to mypy2) at the same time this change is made. When comparisons are made between results, one with mypy and one with mypy2, they will simply be ignored. We can then re-run our important bases (3.10.4 and 3.11.0) and then going forward we will have good comparisons for the new mypy benchmark. (Note this is a rename, not a copy -- no need to keep around the old and known-buggy benchmark.)

(B) An alternative is to not rename the benchmark. In that case, we would want to delete mypy results from all existing results (easy enough to do with a script over the benchmarking repo), and then again re-run the bases, and re-generate all of the comparisons. Going forward, we'd have good results, but it would change all of the existing results as well, since they would no longer include the mypy benchmark. Arguably more accurate, but changing old results seems fishy.

(C) A third option is to add support for versioning benchmarks explicitly in pyperformance, and update pyperf to only compare two benchmarks of the same version. That feels like a lot of upstream work when (A) is probably "good enough".

Thoughts?

brandtbucher commented 1 year ago

(C) A third option is to add support for versioning benchmarks explicitly in pyperformance, and update pyperf to only compare two benchmarks of the same version. That feels like a lot of upstream work when (A) is probably "good enough".

I personally think that this makes the most sense, even if it's also the most work.

This is a general problem, right? Anytime we update the benchmark dependencies (for example, to support a bleeding-edge Python), past results should be invalidated.

We could just use the actual pyperformance version for this invalidation, but I think our ability to run external benchmark suites (like the one where mypy lives) makes this a bit trickier. So per-benchmark versioning seems best to me.

brandtbucher commented 1 year ago

I also remember seeing an issue where the last run of each mypy benchmarking process took twice as long. It appears to still be present in benchmarking results (taken from here):

```json { "metadata": { "description": "Test the performance of mypy types", "loops": 1, "name": "mypy", "python_cflags": "-Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall", "python_compiler": "GCC 9.4.0", "python_executable": "/home/benchmarking/actions-runner/_work/benchmarking/benchmarking/venv/cpython3.12-78380629325c-compat-07e19f9c3993/bin/python", "python_implementation": "cpython", "python_version": "3.12.0a4+ (64-bit) revision f8449d5", "runnable_threads": 1, "tags": [], "timer": "clock_gettime(CLOCK_MONOTONIC), resolution: 1.00 ns" }, "runs": [ { "metadata": { "calibrate_loops": 1, "cpu_freq": "0=2847 MHz; 1=2066 MHz; 11=1696 MHz; 14=1654 MHz", "cpu_temp": "coretemp:Package id 0=53 C", "date": "2023-01-18 13:21:26.842203", "duration": 6.959718631580472, "load_avg_1min": 1.06, "mem_max_rss": 131887104, "uptime": 15287007.843033314 }, "warmups": [ [ 1, 4.411234633997083 ], [ 1, 0.7075972091406584 ], [ 1, 0.6881350222975016 ], [ 1, 1.1252462398260832 ] ] }, { "metadata": { "cpu_freq": "0=2833 MHz; 1=2038 MHz; 11=2942 MHz; 14=1944 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:21:30.479770", "duration": 3.2651518285274506, "load_avg_1min": 1.06, "mem_max_rss": 138518528, "uptime": 15287011.48060894 }, "values": [ 0.6850477494299412, 0.6883566807955503, 1.1895075235515833 ], "warmups": [ [ 1, 0.6765099205076694 ] ] }, { "metadata": { "cpu_freq": "0=2859 MHz; 1=2045 MHz; 11=2898 MHz; 14=2116 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:21:34.021123", "duration": 3.2765407655388117, "load_avg_1min": 1.06, "mem_max_rss": 138825728, "uptime": 15287015.021948338 }, "values": [ 0.6880819406360388, 0.6910915691405535, 1.1931835860013962 ], "warmups": [ [ 1, 0.6797926258295774 ] ] }, { "metadata": { "cpu_freq": "0=2760 MHz; 1=2072 MHz; 11=2944 MHz; 14=1964 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:21:37.550885", "duration": 3.272816140204668, "load_avg_1min": 1.05, "mem_max_rss": 138633216, "uptime": 15287018.551755428 }, "values": [ 0.6853149179369211, 0.6901287399232388, 1.1927908789366484 ], "warmups": [ [ 1, 0.6781393866986036 ] ] }, { "metadata": { "cpu_freq": "0=2942 MHz; 1=2058 MHz; 11=2941 MHz; 14=2117 MHz", "cpu_temp": "coretemp:Package id 0=53 C", "date": "2023-01-18 13:21:41.097663", "duration": 3.281982397660613, "load_avg_1min": 1.05, "mem_max_rss": 138678272, "uptime": 15287022.098504782 }, "values": [ 0.6887508407235146, 0.6922586970031261, 1.1948403157293797 ], "warmups": [ [ 1, 0.6824871376156807 ] ] }, { "metadata": { "cpu_freq": "0=2821 MHz; 1=2086 MHz; 11=2938 MHz; 14=1930 MHz", "cpu_temp": "coretemp:Package id 0=53 C", "date": "2023-01-18 13:21:44.615697", "duration": 3.2555832136422396, "load_avg_1min": 1.04, "mem_max_rss": 138706944, "uptime": 15287025.616525888 }, "values": [ 0.6817592401057482, 0.6851081289350986, 1.1884414814412594 ], "warmups": [ [ 1, 0.6746805161237717 ] ] }, { "metadata": { "cpu_freq": "0=2841 MHz; 1=2098 MHz; 11=2618 MHz; 14=1967 MHz", "cpu_temp": "coretemp:Package id 0=53 C", "date": "2023-01-18 13:21:48.160278", "duration": 3.269907420501113, "load_avg_1min": 1.04, "mem_max_rss": 139644928, "uptime": 15287029.161104202 }, "values": [ 0.6855121348053217, 0.689411161467433, 1.1909860856831074 ], "warmups": [ [ 1, 0.6790228765457869 ] ] }, { "metadata": { "cpu_freq": "0=2908 MHz; 1=1973 MHz; 11=2361 MHz; 14=1943 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:21:51.667699", "duration": 3.2460851315408945, "load_avg_1min": 1.04, "mem_max_rss": 138702848, "uptime": 15287032.668522596 }, "values": [ 0.6806474160403013, 0.6838242039084435, 1.1829292625188828 ], "warmups": [ [ 1, 0.6727914623916149 ] ] }, { "metadata": { "cpu_freq": "0=2857 MHz; 1=2002 MHz; 11=2947 MHz; 14=1944 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:21:55.170154", "duration": 3.241736091673374, "load_avg_1min": 1.04, "mem_max_rss": 138690560, "uptime": 15287036.170981407 }, "values": [ 0.6792661715298891, 0.6825576834380627, 1.1815241426229477 ], "warmups": [ [ 1, 0.6712529547512531 ] ] }, { "metadata": { "cpu_freq": "0=2841 MHz; 1=2056 MHz; 11=2928 MHz; 14=1942 MHz", "cpu_temp": "coretemp:Package id 0=53 C", "date": "2023-01-18 13:21:58.702815", "duration": 3.2730983905494213, "load_avg_1min": 1.04, "mem_max_rss": 139587584, "uptime": 15287039.703672886 }, "values": [ 0.6868738755583763, 0.6909124441444874, 1.1901767514646053 ], "warmups": [ [ 1, 0.678729796782136 ] ] }, { "metadata": { "cpu_freq": "0=2849 MHz; 1=2052 MHz; 11=2387 MHz; 14=1932 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:02.238461", "duration": 3.279697947204113, "load_avg_1min": 1.03, "mem_max_rss": 138588160, "uptime": 15287043.239309072 }, "values": [ 0.6882112268358469, 0.6909137386828661, 1.1934813186526299 ], "warmups": [ [ 1, 0.679974801838398 ] ] }, { "metadata": { "cpu_freq": "0=2866 MHz; 1=2094 MHz; 11=2387 MHz; 14=2104 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:05.757324", "duration": 3.2553564198315144, "load_avg_1min": 1.03, "mem_max_rss": 138928128, "uptime": 15287046.758149862 }, "values": [ 0.6856705024838448, 0.6858444511890411, 1.1864383462816477 ], "warmups": [ [ 1, 0.6731715332716703 ] ] }, { "metadata": { "cpu_freq": "0=2851 MHz; 1=2090 MHz; 11=2607 MHz; 14=1997 MHz", "cpu_temp": "coretemp:Package id 0=53 C", "date": "2023-01-18 13:22:09.284866", "duration": 3.2589324694126844, "load_avg_1min": 1.03, "mem_max_rss": 138702848, "uptime": 15287050.28568983 }, "values": [ 0.6831237468868494, 0.6866450309753418, 1.1892403066158295 ], "warmups": [ [ 1, 0.6755328308790922 ] ] }, { "metadata": { "cpu_freq": "0=2833 MHz; 1=2052 MHz; 11=2405 MHz; 14=1966 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:12.835544", "duration": 3.289483167231083, "load_avg_1min": 1.03, "mem_max_rss": 138706944, "uptime": 15287053.836384535 }, "values": [ 0.6902905385941267, 0.6942205727100372, 1.197494551539421 ], "warmups": [ [ 1, 0.6814223639667034 ] ] }, { "metadata": { "cpu_freq": "0=2829 MHz; 1=2051 MHz; 11=2368 MHz; 14=1967 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:16.368128", "duration": 3.2716568000614643, "load_avg_1min": 1.02, "mem_max_rss": 138706944, "uptime": 15287057.368965626 }, "values": [ 0.6858313214033842, 0.6889316998422146, 1.1920521520078182 ], "warmups": [ [ 1, 0.6794300377368927 ] ] }, { "metadata": { "cpu_freq": "0=2833 MHz; 1=2083 MHz; 11=2922 MHz; 14=1981 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:19.908005", "duration": 3.279032789170742, "load_avg_1min": 1.02, "mem_max_rss": 138702848, "uptime": 15287060.908834219 }, "values": [ 0.6895844154059887, 0.6906602531671524, 1.193552965298295 ], "warmups": [ [ 1, 0.6798438988626003 ] ] }, { "metadata": { "cpu_freq": "0=2952 MHz; 1=2108 MHz; 11=2929 MHz; 14=2102 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:23.433889", "duration": 3.262249119579792, "load_avg_1min": 1.02, "mem_max_rss": 138649600, "uptime": 15287064.434716225 }, "values": [ 0.6844712477177382, 0.6877428032457829, 1.1901249922811985 ], "warmups": [ [ 1, 0.6762959063053131 ] ] }, { "metadata": { "cpu_freq": "0=2826 MHz; 1=2103 MHz; 11=2379 MHz; 14=1984 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:26.964082", "duration": 3.271098930388689, "load_avg_1min": 1.02, "mem_max_rss": 138706944, "uptime": 15287067.96490717 }, "values": [ 0.6867125350981951, 0.6896561123430729, 1.1908974964171648 ], "warmups": [ [ 1, 0.678320312872529 ] ] }, { "metadata": { "cpu_freq": "0=2938 MHz; 1=2085 MHz; 11=2422 MHz; 14=2095 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:30.501741", "duration": 3.275868659839034, "load_avg_1min": 1.02, "mem_max_rss": 138641408, "uptime": 15287071.50257492 }, "values": [ 0.6884814444929361, 0.6906941886991262, 1.1932651698589325 ], "warmups": [ [ 1, 0.6795736849308014 ] ] }, { "metadata": { "cpu_freq": "0=2828 MHz; 1=2060 MHz; 11=2374 MHz; 14=1978 MHz", "cpu_temp": "coretemp:Package id 0=52 C", "date": "2023-01-18 13:22:34.036116", "duration": 3.274069871753454, "load_avg_1min": 1.02, "mem_max_rss": 138752000, "uptime": 15287075.03694582 }, "values": [ 0.6862196736037731, 0.6906554773449898, 1.1926129963248968 ], "warmups": [ [ 1, 0.6790731344372034 ] ] }, { "metadata": { "cpu_freq": "0=3700 MHz; 1=3701 MHz; 11=3702 MHz; 14=3696 MHz", "cpu_temp": "coretemp:Package id 0=58 C", "date": "2023-01-18 13:22:37.587044", "duration": 3.2575900610536337, "load_avg_1min": 1.02, "mem_max_rss": 138694656, "uptime": 15287078.587891817 }, "values": [ 0.6826931200921535, 0.6853487379848957, 1.2073853984475136 ], "warmups": [ [ 1, 0.6756459474563599 ] ] } ] }, ```

That seems like a different issue than the one identified here.

mdboom commented 1 year ago

We could just use the actual pyperformance version for this invalidation, but I think our ability to run external benchmark suites (like the one where mypy lives) makes this a bit trickier. So per-benchmark versioning seems best to me.

The infra currently hashes the version of pyperformance and the pyston benchmarks and adds an asterisk next to the results if they don't match, so at least there's some indication of incompatibility. I was worried that making that a hard requirement would invalidate things way too often. So, yeah, the way out of that probably is to version individual benchmarks and have a hard requirement on those matching.

That seems like a different issue than the one identified here.

Yeah, could be. Will endeavor to fix both things before committing.

mdboom commented 1 year ago

That seems like a different issue than the one identified here.

Yeah, could be. Will endeavor to fix both things before committing.

I can't say I really understand it, but the fix of ignoring the first two iterations seems to address the issue of the last result taking extra time (and additionally, the warning about the stddev being too high goes away).

markshannon commented 1 year ago

Just to be clear, the runs being ignored are in separate processes? The idea is to stabilize the environment, not to allow the VM to "warm up"?

(C) may be the more elegant solution, but I say go for (A). I don't think this merits the extra effort.

mdboom commented 1 year ago

Just to be clear, the runs being ignored are in separate processes?

No, the same process. It runs mypy over a large source file 20 times in a loop. The first time it's caching the file contents in memory, and the idea is to remove the effect of disk I/O. I'm baffled as to why I need to ignore the first two iterations to get stable results, but one isn't enough.

faster-cpython / ideas

Determine and document how to deal with modifying benchmarks #543