Open mdboom opened 1 year ago
(C) A third option is to add support for versioning benchmarks explicitly in pyperformance, and update pyperf to only compare two benchmarks of the same version. That feels like a lot of upstream work when (A) is probably "good enough".
I personally think that this makes the most sense, even if it's also the most work.
This is a general problem, right? Anytime we update the benchmark dependencies (for example, to support a bleeding-edge Python), past results should be invalidated.
We could just use the actual pyperformance version for this invalidation, but I think our ability to run external benchmark suites (like the one where mypy
lives) makes this a bit trickier. So per-benchmark versioning seems best to me.
I also remember seeing an issue where the last run of each mypy
benchmarking process took twice as long. It appears to still be present in benchmarking results (taken from here):
That seems like a different issue than the one identified here.
We could just use the actual pyperformance version for this invalidation, but I think our ability to run external benchmark suites (like the one where
mypy
lives) makes this a bit trickier. So per-benchmark versioning seems best to me.
The infra currently hashes the version of pyperformance and the pyston benchmarks and adds an asterisk next to the results if they don't match, so at least there's some indication of incompatibility. I was worried that making that a hard requirement would invalidate things way too often. So, yeah, the way out of that probably is to version individual benchmarks and have a hard requirement on those matching.
That seems like a different issue than the one identified here.
Yeah, could be. Will endeavor to fix both things before committing.
That seems like a different issue than the one identified here.
Yeah, could be. Will endeavor to fix both things before committing.
I can't say I really understand it, but the fix of ignoring the first two iterations seems to address the issue of the last result taking extra time (and additionally, the warning about the stddev being too high goes away).
Just to be clear, the runs being ignored are in separate processes? The idea is to stabilize the environment, not to allow the VM to "warm up"?
(C) may be the more elegant solution, but I say go for (A). I don't think this merits the extra effort.
Just to be clear, the runs being ignored are in separate processes?
No, the same process. It runs mypy over a large source file 20 times in a loop. The first time it's caching the file contents in memory, and the idea is to remove the effect of disk I/O. I'm baffled as to why I need to ignore the first two iterations to get stable results, but one isn't enough.
The motivating thing here is that the
mypy
benchmark inpyston-macrobenchmarks
is arguably buggy. It has far greater variation in the measurements than other benchmarks:The reason for this is that it runs mypy over some files 20 times -- the first time it actually loads the files from disk, and subsequent times they are read from an in-memory cache in mypy itself. (To be clear, I'm not talking about the effects of any OS-level filesystem caching that may be going on). The fix is easy---ignore the first run through the loop. The problem is that if we fix this, we will artificially "improve" our results -- the fixed version of the benchmark is around 2x faster than the old one.
(A) I think the safest thing to do is to rename the benchmark (to
mypy2
) at the same time this change is made. When comparisons are made between results, one withmypy
and one withmypy2
, they will simply be ignored. We can then re-run our important bases (3.10.4 and 3.11.0) and then going forward we will have good comparisons for the newmypy
benchmark. (Note this is a rename, not a copy -- no need to keep around the old and known-buggy benchmark.)(B) An alternative is to not rename the benchmark. In that case, we would want to delete
mypy
results from all existing results (easy enough to do with a script over thebenchmarking
repo), and then again re-run the bases, and re-generate all of the comparisons. Going forward, we'd have good results, but it would change all of the existing results as well, since they would no longer include themypy
benchmark. Arguably more accurate, but changing old results seems fishy.(C) A third option is to add support for versioning benchmarks explicitly in pyperformance, and update pyperf to only compare two benchmarks of the same version. That feels like a lot of upstream work when (A) is probably "good enough".
Thoughts?