JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.77k stars 5.49k forks source link

Revive performance regression testing? #11709

Closed johnmyleswhite closed 8 years ago

johnmyleswhite commented 9 years ago

As discussed in #11683, it would be great to revive some kind of performance regression testing so that we don't slowly accumulate perf problems.

In an ideal world, we'd standardize on a system that non-Base packages could use as well. But for now it would be good to just get Base being tested for perf rigorously again.

carnaval commented 9 years ago

Here are some things about my dream benchmark suite (maybe too much work, but one can hope) :

To summarize, we need to have something very robust that we can trust. I don't want to have what we can see right now with everyone handwaving "unrelated" CI crashes which then makes us introduce actual bugs that should have been caught. It is starting to irritate me a bit :-)

yuyichao commented 9 years ago

@carnaval What about The Phoronix Test Suite I've never use it myself but it seems that it is pretty good at comparing two things (commits) and they are pretty serious about it (so they are probably doing the statistics right).

A timeline like what codespeed provide would still be nice though.

KristofferC commented 9 years ago

Couldn't a first step be to simply revive http://speed.julialang.org? Was the problem a lack of finances for hardware or was there significant manual work needed keeping the site up? I think it is easier to get people to write performance test if they have some pretty graphs/numbers to look at. Getting something decent going is good enough while being on the way to the "performance testing nirvana" @carnaval mentioned.

johnmyleswhite commented 9 years ago

All great points. My take on each issue:

tkelman commented 9 years ago

+1 for porting Criterion, have heard nothing but praise for that library.

Based on @KristofferC's comments here https://github.com/JuliaLang/julia/pull/11683#issuecomment-111875364 I will retract my suggestion of looking into CDash, it looked like it could save some work in just having a place to put large amounts of historical test data, but if it's a Jenkins-style quagmire of configuration files to set up then nevermind. Also looks like they only store 30 days worth of data on free plans.

This jogged my memory, there was a past issue on this at #9456 which was closed without any particular resolution.

johnmyleswhite commented 9 years ago

I've got about half of the work done to get something inspired by Criterion finished. Should have final results tomorrow night.

IainNZ commented 9 years ago

Speed.julialang.org was @staticfloat's baby. At the time we didn't have the build infrastructure we have now, so I think it because it was trying to run on every commit and was also dealing with build issues there were a lot of problems. Many of the perf tests didn't seem to to be set up to run long enough as well, creating some spurious results. It'd probably be easier to get something working now (using e.g. generic nightly linux binaries, like PackageEvaluator does)

IainNZ commented 9 years ago

Would love to write some perf tests too, especially for natural-but-not-highest-possible-perf code (the kind you see on Stackoverflow). Only worth it if we're tracking them.

tkelman commented 9 years ago

On the infrastructure side of plumbing up the github webhooks, we should probably borrow some tools from Rust and use https://github.com/barosl/homu to help address https://github.com/staticfloat/julia-buildbot/issues/1. The buildbots we have at http://buildbot.e.ip.saba.us:8010/builders run surpringly often on master. To start with we should work out a method of pinging a @juliabot (looks like that's taken but inactive, maybe @juliabuild?) in the comments of a PR to fire off perf and/or PkgEvaluator runs on-demand.

IainNZ commented 9 years ago

IIRC @Keno owns @JuliaBot

IainNZ commented 9 years ago

And I was messing around with this: https://github.com/MechaJulia/MechaJulia for a while

Keno commented 9 years ago

I believe I have @julia-bot with a dash.

tkelman commented 9 years ago

MechaJulia's not a bad name, but dash vs no dash would get misspelled way too often.

Realistically buildbot and other projects have already done a lot of work here, there are plenty of off-the-shelf solutions we can start from. The inspired-by-criterion benchmarking harness should probably be written in Julia, but most of the other infrastructure shouldn't IMO.

johnmyleswhite commented 9 years ago

Totally agree with that perspective

johnmyleswhite commented 9 years ago

Didn't manage to finish this work tonight, but the core ideas are in place at https://github.com/johnmyleswhite/Benchmarks.jl

I'd like to make it run a lot faster, but it currently does satisfy my core design constraints:

There's still a lot more work to make it nice to use and, more importantly, to produce correct confidence intervals, but I'm hopeful this design should improve upon both the Benchmark.jl and BenchmarkLite.jl packages.

staticfloat commented 9 years ago

I'm late to the party! Here were/are the challenges with speed.julialang.org:

johnmyleswhite commented 9 years ago

Yeah, if you're willing to dedicate a machine, that would be awesome. My personal feeling is that doing testing and benchmarking right is one of the best ways that Julia could spend the money that's donated via NumFocus, so I'd like to think we could acquire at least one dedicated machine for Julia as an institution.

I think using codespeed is better than nothing. I guess my question is: do you think their data schema is bad or just their interface? I would be just as happy throwing up a few trivial static graphs that would be easy to generate. The interactive interface doesn't interest me that much.

staticfloat commented 9 years ago

Let's settle on the metrics we want to report, and then evaluate whether codespeed in its current incarnation will serve the purpose properly.

IainNZ commented 9 years ago

@johnmyleswhite, I'd definitely be in favor of static graphs for making the maintenance load on any one person as simple as possible. If someone wants to help out with Codespeed, there is going to be a fixed cost to get orientated that might not be worth it. Just a thought.

staticfloat commented 9 years ago

Perhaps we can make life easy for ourselves by having a simple webpage with, for example, Gadfly plots embedded that show the general trends of things, and then just have the full data for the last N days downloadable as .jld files or something, in case someone wants to get into the nitty-gritty? The philosophy I'd like to have with this is minimal maintenance, but not sacrificing the ability to investigate what exactly is going on.

The reason I mention Gadfly is because It is really, really nice to be able to hover over the first datapoint that shows a regression and see what commit hash that was.

johnmyleswhite commented 9 years ago

I'm on board with all of that.

IainNZ commented 9 years ago

Thats the http://pkg.julialang.org/pulse.html philosophy! :+1:

staticfloat commented 9 years ago

I just took a look at how pkg.julialang.org is run. That is kind of hilarious.

IainNZ commented 9 years ago

Indeed :D For those who don't know, the process is:

1) Cron job launches 4 virtual machines (via Vagrant) on @mlubin's box in our office. 2) Madness happens, JSONs are created 3) I have manually run a script to scp those files to my local machine, bash the JSONs around some more and enhance with more info from Github. This includes enlarging the test history database (a text file), and updating the badges (one file per Julia version per package). 4) I check, run git commit, push. Website is updated 5) Optional: I run another script to that goes through test status degradations and lets me file an issue if I think its appropriate.

devops

kshyatt commented 9 years ago

Can I get more details on step 2, please?

IainNZ commented 9 years ago

Essentially, runs this script: https://github.com/IainNZ/PackageEvaluator.jl/blob/master/scripts/setup.sh to get the VM set up. Then adds the packages, runs code to detect the license and some other stuff, then attempts to find and run the package's tests.

That last bit is a lot of the madness, and it was from before Pkg.test even existed. I'm actually going to rip that out - things should be a lot simpler, but a decent number of packages will shift from the "green" to "blue", because Pkg.test won't find anything.

jakebolewski commented 9 years ago

I like the simplicity of Chapel's simple performance tracking website (http://chapel.sourceforge.net/perf/chap04/). I feel something similar would be more than enough for our needs at this time.

IainNZ commented 9 years ago

Somewhat on-topic, the choice of mean for measuring relative performance is a pretty interesting topic (to me at least), here is one paper: http://mprc.pku.cn/courses/organization/spring2008/PaperReading/Performance%20and%20Evaluation/Characterizing%20computer%20performance%20with%20a%20single%20number.pdf

I see people comparing relative performance with arithmetic means worryingly often...

johnmyleswhite commented 9 years ago

I would argue that the arithmetic mean is generally what you want since it's the only number that trivially extrapolates to the long-term cost of evaluating something repeatedly in production. Cf. Radford Neal's disapproval of the use of medians in one R microbenchmarking paper: https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/

In an ideal world, you'd want to compare the full probability distributions of timings for stochastic dominance, but that's rarely viable.

FWIW, I think the main problem with relative performance via arithmetic means isn't a problem with arithmetic means per se, but with the lack of application of a proper ratio estimator: https://en.wikipedia.org/wiki/Ratio_estimator

IainNZ commented 9 years ago

Arithmetic mean over samples on a single benchmark, sure. I meant, compare Julia 0.4 vs Julia 0.3 on a suite of metrics. Taking arithmetic mean for average speed up of Julia 0.4 over 0.3 on a suite of benchmarks wouldn't be right. That ratio estimator thing is neat, need to look at that.

johnmyleswhite commented 9 years ago

I think I'm confused. You're taking arithmetic mean over a suite of benchmarks where speedups are measured in absolute units or percent change?

IainNZ commented 9 years ago

Like I have 10 benchmarks, and lets just say total time for Julia 0.3 and Julia 0.4 on those benchmarks, total over 100 samples. For each benchmark I have ratio J4 / J3, that (hopefully) measures how much fatster Julia 0.4 is on that benchmark. To get a single number for performance improvement, arithmetic mean of those ratios would not be a good idea (from my understanding), and can give some non-intuitive results that don't really capture what was intended. There seems to be debate whether geometric mean or harmonic mean is more appropriate.

IainNZ commented 9 years ago

So percentage change, not absolute

johnmyleswhite commented 9 years ago

Isn't the arithmetic mean over all of the ratios just some scalarized attempt to solve a multi-objective optimization problem -- and hence one that at least guarantees that your induced objective function would always lead to Pareto improvements? I need to read that paper in depth, but I find it hard to imagine a priori that geometric means or harmonic means are a big improvement when the core difficulty is that you can't know what the relative importance of each benchmark is.

ScottPJones commented 9 years ago

That article from Neal Radford is interesting, but it seems to me that the reason he prefers mean is that the system has a lot of hidden costs (garbage collection) that cause large performance drops on some of the samples, and that needs to be taken into account. I ran into issues benchmarking where looking at the median and MAD (median absolute deviation) worked out better than arithmetic mean and standard deviation, to avoid counting external slowdowns.

ScottPJones commented 9 years ago

I agree with @johnmyleswhite, (at least about the parts I understand!), you can't just take a mean on different benchmark percentages or ratios, the number you get is meaningless, unless you have different sets of weightings for the results, to match how much those operations are used in different use cases.

IainNZ commented 9 years ago

@johnmyleswhite thats an interesting perspective... I guess one thing that is clear is that arithmetic mean isn't inherently wrong without some definition of what wrong (or right) is. I've basically seen and done some playing around myself and found that arithmetic mean did some fairly unintuitive things when the ratios are not uniformly >= or <= 1, and that the geometric mean was a better way to capture it in a single number - I think thats what these papers are getting at. Would be interesting to formalize it though.

johnmyleswhite commented 9 years ago

Still more work to be done, but I think the basic design of the Criterion-style benchmarking library is close to finished: https://github.com/johnmyleswhite/Benchmarks.jl/

nalimilan commented 9 years ago

@IainNZ I think you really want to use (possibly weighted) geometric means when you're working with ratios. If on one machine 0.4 ran twice faster than 0.3, and another one twice slower, the mean should be sqrt(1/2 * 2) = 1, not (1/2 + 2)/2 = 1.25.

GlenHertz commented 9 years ago

To backup what @nalimilan said, search for the paper "How not to lie with statistics: the correct way to summarize benchmark results".

pao commented 9 years ago

This being the internet, we can also just link to things: Fleming1986 is the paper in question.

ScottPJones commented 9 years ago

Just to drive you all a bit crazy :grinning:, have you thought of multiprocess benchmarking? Most all of my benchmarking experience was in benchmarking operations using all of the cores, because single process benchmarking doesn't give a good indicator of systemwide performance. Let's say you have algorithm A, which is 2x as fast as algorithm B, when run on a single process, but it uses 64K of temp array space... whereas algorithm B requires only 2K... are you really sure that A is going to do better when you have 1000 processes on a 32 core machine? Usage of your L1/L2 and even L3 caches becomes critical... cache line ping-ponging can also make or break scalability...

IainNZ commented 9 years ago

Ahh @GlenHertz @pao, that was the paper I was thinking of!

@nalimilan yes that was what I was saying.

johnmyleswhite commented 9 years ago

FWIW, here's how I think one should handle these kinds of problems: http://eytan.github.io/benchmarking.pdf

tkelman commented 9 years ago

At least in operations research (surprised Iain hasn't mentioned this yet) the standard presentation format for benchmarks over large test sets with a variety of large and small problem instances is Dolan-More performance profiles: http://arxiv.org/pdf/cs/0102001.pdf

That format of presenting results is much more valuable when you're comparing more than 2 alternatives though.

IainNZ commented 9 years ago

Oh man how did I forget that! Seriously, performance profiles are some interesting stuff once you wrap your head around them.

mlhetland commented 9 years ago

Just to add to the confusion: Citron et al., in their The harmonic or geometric mean: does it really matter?, have an interesting historical timeline of “The War of the Means,” which seems to have been going on for a while in the hardware community. I also kinda like Bast and Weber's Don't Compare Averages.

I think Eugster, Hothorn and Leisch have some interesting solutions in their Exploratory and Inferential Analysis of Benchmark Experiments, which they have implemented in the R package benchmark. Many of these solutions avoid the use of averages, by summarizing all the data in other ways, e.g., histograms if the occurrence of each of the compared methods at each ranking, from first to last, or by constructing a consensus ranking across the various benchmarks (although that raises the question of how to weigh them in constructing the consensus, I guess).

BTW: Kudos & thanks to @johnmyleswhite for Benchmarks.jl! Always nice to cross stuff out on my wishlist :smile_cat: :gift:

jrevels commented 8 years ago

see #13893

tkelman commented 8 years ago

To repeat what others have said, you've done a really impressive job putting this together @jrevels. :applause:

jrevels commented 8 years ago

Thanks!