Separate synthetic benchmarks from real-world benchmarks

maximecb commented 3 years ago

We should separate out the microbenchmarks and synthetic benchmarks (eg: 30k_, cfunc_itself, fib, nbody, optcarrot, etc) from the benchmarks based on real-world software and gems. Ideally, we would report them on a different graph and present that first (before the synthetic benchmarks graph).

The synthetic benchmarks are potentially useful indicators and debugging tools, but IMO, they make YJIT look a bit less serious and they distort the Y axis. We should also compute the percentage of how much faster we are than YJIT/CRuby using only "real-world" benchmarks.

noahgibbs commented 3 years ago

Okay. Unless I hear otherwise, I'll consider any single-file benchmark to be synthetic rather than real-world. If you'd like some other division, let me know.

maximecb commented 3 years ago

I think we might have to manually tag the real-world benchmarks unfortunately, because optcarrot, nbody, fannkuch, binarytrees and lee also fit in the synthetic category.

If that's inconvenient then maybe we can just not report the single-file benchmarks on the front page and avoid labelling anything a real-world benchmark or a microbenchmark.

noahgibbs commented 3 years ago

I can keep a hardcoded list. But then it will act like a hardcoded list and we'll have to periodically update it manually.

noahgibbs commented 3 years ago

It would also be possible to add some kind of tagging in benchmark.rb. That would require adding something to the yjit-bench harness I expect, since we'd want to add a method for tagging. The yjit-bench harness could just ignore it, naturally.

maximecb commented 3 years ago

Adding a flag to the benchmark harness seems a bit strange since the benchmarking code itself doesn't / shouldn't care how we humans categorize the benchmarks. Maybe a hardcoded list is ok for now? It really comes down to how we want to display the data.

noahgibbs commented 3 years ago

I'll start with a hardcoded list.

As far as putting it in the harness - if I were going with that approach, I'd actually put it in benchmark.rb, basically as some kind of metadata. It's the same general idea as how we pass a "recommended" number of iterations, usually ignored, for each benchmark - you store the various benchmark-specific information with the benchmark rather than in some external list, where possible. Best, of course, is to store it in an executable, actionable form where it's less likely to rot.

"Real-world vs synthetic" is certainly about how humans categorise the benchmarks, but it's also fairly specifically about each individual benchmark. The more of that you store in other random places, the more stuff you have to update in two+ places. But if you had e.g. a "metadata tags" call that would accept (and usually ignore) arbitrary data, it gives you a place to collect random things you might want to know about the benchmark, where they get read and touched when the benchmark does.

But for now I'll put a hardcoded list in yjit-metrics.

noahgibbs commented 3 years ago

Hm. Here's an oddity: I would normally put optcarrot into "real" benchmarks. It's a headless NES emulator that correctly runs actual NES images. It may actually be the most real of our benchmarks, certainly more useful than Railsbench. Lee (a Sudoku solver) is probably its best competition for "most useful yjit-bench benchmark". The other things I'd count as possibly real (activerecord, jekyll, liquid-render, mail, psych-load) are all simple library load-tests.

That makes our graphs look a lot worse, though, because MJIT is really well tuned for optcarrot. Here's what the real-only graph looks like on my machine:

It also messes with our speed-vs-MJIT number in the headline quite a lot. Optcarrot is a long-running benchmark and MJIT optimises it very well. Overall we come out 3.8% faster than MJIT on (only real-world) latest results. Depending on the run, it looks like we're 2%-to-5% faster than MJIT when we calculate that way. So that's what our front page will say if we merge this as-is.

This might be an argument for adding more Rails-based or ActiveRecord-based benchmarks, where YJIT has stronger performance vs MJIT.

noahgibbs commented 3 years ago

Note to self: current work for this is in the separate_real_and_synth branch; supporting Jekyll changes are local in "pages".

noahgibbs commented 3 years ago

Hm. I can weight the benchmark averages per-benchmark (scale to that specific benchmark's YJIT performance), which makes long-running benchmarks less influential, and that helps the headline problem. It brings our perf vs MJIT back to the same level, and improves our perf vs CRuby as well: "Overall YJIT is 29.4% faster than interpreted CRuby, or 11.5% faster than MJIT".

But the big eye-catching thing on the real-world benchmarks graph is still "MJIT is really fast on optcarrot". That might be fine.

maximecb commented 3 years ago

Well, synthetic benchmark simply means that the benchmark was written by people for the purpose of being a benchmark. Optcarrot fits that definition IMO. It has been optimized to run fast on Ruby. Lee is Chris Seaton's electronic circuit routing benchmark. It's a nice piece of code and it has useful real-world value, but we don't want to have benchmarks written by Ruby compiler people in that set.

I would actually say railsbench is a pretty decent benchmark. The rails framework itself wasn't written for the purpose of being a benchmark. It's very challenging to optimize and the stats for it are pretty similar to what we see in production. Much more similar than optcarrot. Could it be better and more realistic? Definitely.

Optcarrot is a long-running benchmark and MJIT optimises it very well. Overall we come out 3.8% faster than MJIT on (only real-world) latest results.

I don't think you should be weighing by how long the benchmarks are running.

However, when averaging percentages, you should definitely be using the geometric mean rather than the arithmetic mean: https://www.jstor.org/stable/2276859 https://en.wikipedia.org/wiki/Geometric_mean

That's going to avoid over-representing very large values.

noahgibbs commented 3 years ago

I'm currently not explicitly weighting by the runtime - I'm adding them up, which implicitly weights larger numbers more highly. It has the advantage of a really obvious interpretation: "if I ran each of these benchmarks once in these two different Rubies, what would be the ratio of the two total runtimes?" But of course that weights long-running benchmarks more heavily, just as our actual benchmarks do when we run them.

You're right, arithmetic mean is a terrible idea. I was trying to remember geometric mean and found harmonic mean, which is a different wrong thing :-) I'm not sure how you'd explain the geometric mean's interpretation here. I'll see if I can find something more explanatory than "it's the geometric mean, you just multiply all the ratios and take the root."

All of these were, in a literal sense, written to be benchmarks. OptCarrot is unusual in having a real-world usage at all. But we won't have many "real-world" benchmarks if we disqualify lee and optcarrot.

30k_ifelse, 30k_methods, cfunc_itself, fib, getivar, setivar, respond_to - these are simple synthetic microbenchmarks binarytrees, fannkuchredux, nbody - these are from a benchmark competition activerecord, psych-load, mail - these are extremely simple call-in-a-loop load tests

(Leaving only optcarrot, lee, railsbench, liquid-render and jekyll. That's the whole list.)

So does "real-world" mean railsbench and liquid-render? Jekyll clearly has some kind of resource leak and I've been treating its results as dubious until I figure that out and (if possible) fix it. Liquid-render explicitly uses a "profiling" theme, but maybe that's fine.

maximecb commented 3 years ago

I'm defining liquid, railsbench, mail, jekyll, activerecord and psych-load as "real-world" because yes we are calling something in a loop but that code wasn't written by us, it's from those software packages.

Yes, the lines are blurry and this is a bit of an arbitrary separation, but the code for optcarrot and lee is very much written in a style that has large methods and can be easier to optimize. Lee was written by Chris and TruffleRuby performs better on it than on any of our other benchmarks, which is no surprise. I could write a benchmark in a style that I know YJIT is going to perform well at, and it might give results like that you see for those 30k_ifelse. That's why I'm excluding myself from writing any of the benchmarks we classify as "real-world".

It comes down to which benchmarks we want to put forward and focus our optimization efforts on. Optcarrot hasn't really been an optimization target for us because ultimately it's not representative of real-world code and nobody really cares. It can be good for showing off but not that useful for 99% of the Ruby community.

noahgibbs commented 3 years ago

Fixed by https://github.com/Shopify/yjit-metrics/pull/72

maximecb commented 3 years ago

Looks very good Noah! Now we can clearly see the difference between those benchmarks.

Shopify / yjit-metrics

Separate synthetic benchmarks from real-world benchmarks #68