Investigation: Why is the performance of the mypy2 bench so poor on main relative to 3.12.0?

faster-cpython / ideas

1.67k stars 49 forks source link

Investigation: Why is the performance of the mypy2 bench so poor on main relative to 3.12.0? #646

Closed mdboom closed 5 months ago

mdboom commented 5 months ago

Currently, the mypy2 benchmark is a real outlier at 2.5x slower on main than on 3.12.0. What happened to make it so much worse? Is it merely turning on Tier 2, or something else?

Plotting the data we have for this benchmark over time, it looks like it is not Tier 2 related, but happened at a fixed moment in time (though there is one weird outlier, this may have something to do with git merge history rather than anything else):

test

The massive slowdown happened sometime between CPython commit 3faf8e5 and 05a370a, which is a range of 76 commits. I'm going to bisect this to see if I can find the culprit.

Cc: @markshannon

brandtbucher commented 5 months ago

In the past, we've had an issue where sometimes the mypyc-compiled C version of mypy was installed instead of the pure-python version (which we want). I believe this was done by using --no-binary or something similar in the pip install command.

This almost looks like either:

3.12 used to install the Python version, and started installing the C version.
3.13 used to install the C version, and started installing the Python version.

Not 100% sure this is the issue, but given the history of this benchmark it's the first place my mind went.

mdboom commented 5 months ago

Yeah - my mind went there, too, but I think I've ruled that out (we're getting Python versions everywhere, as far as I can tell).

mdboom commented 5 months ago

It's a similar problem to what @brandtbucher suggested -- it's the benchmark changing, not CPython, just not the C vs. Python problem. That date is the moment that this change to the benchmark was deployed to our benchmarking infrastructure. In hindsight, we probably should have renamed the benchmark given that it changes the results so dramatically -- (No blame -- I reviewed that PR, IIRC).

I think the thing to do is: 1) Remove this benchmark entirely from our dataset -- this is likely to have the effect of slightly improving the results on recent commits 2) Backfill the bases with the new version of the benchmark 3) Then going forward we should have reliable results for this benchmark

mdboom commented 5 months ago

Closing -- the above steps are all complete.