faster-cpython / ideas

1.68k stars 48 forks source link

Investigation: Are we using the right statistics to show improvement in our benchmarks? #688

Open mdboom opened 3 months ago

mdboom commented 3 months ago

Based on a conversation I had with @brandtbucher, I feel it's time to reinvestigate the various methods we use to arrive at an overall improvement number for our benchmarks. To summarize, we currently provide:

There are a few puzzling things:

We are now in a good position with a lot of data collected over a long period. I should play with the different statistical methods we have to see which are truly the most valuable to meet the following (which may require different solutions):

a) understand if a change is helpful b) show how far we've come c) reduce measurement noise

brandtbucher commented 3 months ago

I think that a useful invariant would be that if we have two independent changes with "headline" numbers A and B vs some common base, then landing both changes should result in a new headline number that's equal to A * B, regardless of the "shape" of the results for each. I have a nagging worry that we might have statistical situations where two "one percent" improvements could combine to a one percent (or even a zero percent) improvement.

mdboom commented 3 months ago

I think that a useful invariant would be that if we have two independent changes with "headline" numbers A and B vs some common base, then landing both changes should result in a new headline number that's equal to A * B, regardless of the "shape" of the results for each.

I agree with what you are saying, but in practice it does seem like one change could hide in another, e.g. both changes create better cache locality, and when you put them together you don't get that win "twice". I think a looser invariant is that if A > 1 and B > 1, also A+B > 1, and I'm not even sure we are currently meeting that basic invariant.

mdboom commented 3 months ago

I created longitudinal plots that show 4 different aggregation methods together:

Here is that on 3.14.x with JIT against 3.13.0b3 (main Linux machine only):

evaluate-314

And here is that with the classic 3.11.x against 3.10.4 (main Linux machine only):

evaluate-311

It's nice to see that they all are more-or-less parallel with some offset, and while you can see HPT reducing variation (as designed) the other alternatives aren't uselessly noisy either. It's tempting to use "overall mean" because it's the most favourable, but that feels like cherry-picking.

We don't quite have all the data to measure Brandt's suggestion. However, we can test the following for each of the methods: for 2 adjacent commits A and B and a common base C, if B:A > 1, B:C must be > A:C. The only method where this doesn't hold true is the overall mean method.

Lastly, I experimented with bringing the same nuance we have in the benchmarking plots to the longitudinal ones -- it's possible to show violin plots for each of the entries like this (again 3.11.x vs. 3.10.4):

violins-311

(Imagine the x axis is dates -- it's a little tricky to make that work...)

This plot is interesting because it clearly shows where the "mean" improvement is but also that there are a significant number of specific use cases where you can do much better than that -- I do sort of find it helpful to see that.

Anyway, there's still more things to look at here -- just wanted to provide a braindump and get some feedback in the meantime.

mdboom commented 3 months ago

A ridgeline plot seems kind of useful for visualizing the improvement of Python 3.11 over 3.10:

violins-311