Closed johnmyleswhite closed 8 years ago
Here are some things about my dream benchmark suite (maybe too much work, but one can hope) :
To summarize, we need to have something very robust that we can trust. I don't want to have what we can see right now with everyone handwaving "unrelated" CI crashes which then makes us introduce actual bugs that should have been caught. It is starting to irritate me a bit :-)
@carnaval What about The Phoronix Test Suite I've never use it myself but it seems that it is pretty good at comparing two things (commits) and they are pretty serious about it (so they are probably doing the statistics right).
A timeline like what codespeed provide would still be nice though.
Couldn't a first step be to simply revive http://speed.julialang.org? Was the problem a lack of finances for hardware or was there significant manual work needed keeping the site up? I think it is easier to get people to write performance test if they have some pretty graphs/numbers to look at. Getting something decent going is good enough while being on the way to the "performance testing nirvana" @carnaval mentioned.
All great points. My take on each issue:
+1 for porting Criterion, have heard nothing but praise for that library.
Based on @KristofferC's comments here https://github.com/JuliaLang/julia/pull/11683#issuecomment-111875364 I will retract my suggestion of looking into CDash, it looked like it could save some work in just having a place to put large amounts of historical test data, but if it's a Jenkins-style quagmire of configuration files to set up then nevermind. Also looks like they only store 30 days worth of data on free plans.
This jogged my memory, there was a past issue on this at #9456 which was closed without any particular resolution.
I've got about half of the work done to get something inspired by Criterion finished. Should have final results tomorrow night.
Speed.julialang.org was @staticfloat's baby. At the time we didn't have the build infrastructure we have now, so I think it because it was trying to run on every commit and was also dealing with build issues there were a lot of problems. Many of the perf tests didn't seem to to be set up to run long enough as well, creating some spurious results. It'd probably be easier to get something working now (using e.g. generic nightly linux binaries, like PackageEvaluator does)
Would love to write some perf tests too, especially for natural-but-not-highest-possible-perf code (the kind you see on Stackoverflow). Only worth it if we're tracking them.
On the infrastructure side of plumbing up the github webhooks, we should probably borrow some tools from Rust and use https://github.com/barosl/homu to help address https://github.com/staticfloat/julia-buildbot/issues/1. The buildbots we have at http://buildbot.e.ip.saba.us:8010/builders run surpringly often on master. To start with we should work out a method of pinging a @juliabot
(looks like that's taken but inactive, maybe @juliabuild?) in the comments of a PR to fire off perf and/or PkgEvaluator runs on-demand.
IIRC @Keno owns @JuliaBot
And I was messing around with this: https://github.com/MechaJulia/MechaJulia for a while
I believe I have @julia-bot with a dash.
MechaJulia's not a bad name, but dash vs no dash would get misspelled way too often.
Realistically buildbot and other projects have already done a lot of work here, there are plenty of off-the-shelf solutions we can start from. The inspired-by-criterion benchmarking harness should probably be written in Julia, but most of the other infrastructure shouldn't IMO.
Totally agree with that perspective
Didn't manage to finish this work tonight, but the core ideas are in place at https://github.com/johnmyleswhite/Benchmarks.jl
I'd like to make it run a lot faster, but it currently does satisfy my core design constraints:
@time
macro when that macro is appropriateThere's still a lot more work to make it nice to use and, more importantly, to produce correct confidence intervals, but I'm hopeful this design should improve upon both the Benchmark.jl and BenchmarkLite.jl packages.
I'm late to the party! Here were/are the challenges with speed.julialang.org:
Yeah, if you're willing to dedicate a machine, that would be awesome. My personal feeling is that doing testing and benchmarking right is one of the best ways that Julia could spend the money that's donated via NumFocus, so I'd like to think we could acquire at least one dedicated machine for Julia as an institution.
I think using codespeed is better than nothing. I guess my question is: do you think their data schema is bad or just their interface? I would be just as happy throwing up a few trivial static graphs that would be easy to generate. The interactive interface doesn't interest me that much.
Let's settle on the metrics we want to report, and then evaluate whether codespeed in its current incarnation will serve the purpose properly.
@johnmyleswhite, I'd definitely be in favor of static graphs for making the maintenance load on any one person as simple as possible. If someone wants to help out with Codespeed, there is going to be a fixed cost to get orientated that might not be worth it. Just a thought.
Perhaps we can make life easy for ourselves by having a simple webpage with, for example, Gadfly plots embedded that show the general trends of things, and then just have the full data for the last N days downloadable as .jld
files or something, in case someone wants to get into the nitty-gritty? The philosophy I'd like to have with this is minimal maintenance, but not sacrificing the ability to investigate what exactly is going on.
The reason I mention Gadfly is because It is really, really nice to be able to hover over the first datapoint that shows a regression and see what commit hash that was.
I'm on board with all of that.
Thats the http://pkg.julialang.org/pulse.html philosophy! :+1:
I just took a look at how pkg.julialang.org is run. That is kind of hilarious.
Indeed :D For those who don't know, the process is:
1) Cron job launches 4 virtual machines (via Vagrant) on @mlubin's box in our office. 2) Madness happens, JSONs are created 3) I have manually run a script to scp those files to my local machine, bash the JSONs around some more and enhance with more info from Github. This includes enlarging the test history database (a text file), and updating the badges (one file per Julia version per package). 4) I check, run git commit, push. Website is updated 5) Optional: I run another script to that goes through test status degradations and lets me file an issue if I think its appropriate.
Can I get more details on step 2, please?
Essentially, runs this script: https://github.com/IainNZ/PackageEvaluator.jl/blob/master/scripts/setup.sh to get the VM set up. Then adds the packages, runs code to detect the license and some other stuff, then attempts to find and run the package's tests.
That last bit is a lot of the madness, and it was from before Pkg.test
even existed. I'm actually going to rip that out - things should be a lot simpler, but a decent number of packages will shift from the "green" to "blue", because Pkg.test won't find anything.
I like the simplicity of Chapel's simple performance tracking website (http://chapel.sourceforge.net/perf/chap04/). I feel something similar would be more than enough for our needs at this time.
Somewhat on-topic, the choice of mean for measuring relative performance is a pretty interesting topic (to me at least), here is one paper: http://mprc.pku.cn/courses/organization/spring2008/PaperReading/Performance%20and%20Evaluation/Characterizing%20computer%20performance%20with%20a%20single%20number.pdf
I see people comparing relative performance with arithmetic means worryingly often...
I would argue that the arithmetic mean is generally what you want since it's the only number that trivially extrapolates to the long-term cost of evaluating something repeatedly in production. Cf. Radford Neal's disapproval of the use of medians in one R microbenchmarking paper: https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
In an ideal world, you'd want to compare the full probability distributions of timings for stochastic dominance, but that's rarely viable.
FWIW, I think the main problem with relative performance via arithmetic means isn't a problem with arithmetic means per se, but with the lack of application of a proper ratio estimator: https://en.wikipedia.org/wiki/Ratio_estimator
Arithmetic mean over samples on a single benchmark, sure. I meant, compare Julia 0.4 vs Julia 0.3 on a suite of metrics. Taking arithmetic mean for average speed up of Julia 0.4 over 0.3 on a suite of benchmarks wouldn't be right. That ratio estimator thing is neat, need to look at that.
I think I'm confused. You're taking arithmetic mean over a suite of benchmarks where speedups are measured in absolute units or percent change?
Like I have 10 benchmarks, and lets just say total time for Julia 0.3 and Julia 0.4 on those benchmarks, total over 100 samples. For each benchmark I have ratio J4 / J3, that (hopefully) measures how much fatster Julia 0.4 is on that benchmark. To get a single number for performance improvement, arithmetic mean of those ratios would not be a good idea (from my understanding), and can give some non-intuitive results that don't really capture what was intended. There seems to be debate whether geometric mean or harmonic mean is more appropriate.
So percentage change, not absolute
Isn't the arithmetic mean over all of the ratios just some scalarized attempt to solve a multi-objective optimization problem -- and hence one that at least guarantees that your induced objective function would always lead to Pareto improvements? I need to read that paper in depth, but I find it hard to imagine a priori that geometric means or harmonic means are a big improvement when the core difficulty is that you can't know what the relative importance of each benchmark is.
That article from Neal Radford is interesting, but it seems to me that the reason he prefers mean is that the system has a lot of hidden costs (garbage collection) that cause large performance drops on some of the samples, and that needs to be taken into account. I ran into issues benchmarking where looking at the median and MAD (median absolute deviation) worked out better than arithmetic mean and standard deviation, to avoid counting external slowdowns.
I agree with @johnmyleswhite, (at least about the parts I understand!), you can't just take a mean on different benchmark percentages or ratios, the number you get is meaningless, unless you have different sets of weightings for the results, to match how much those operations are used in different use cases.
@johnmyleswhite thats an interesting perspective... I guess one thing that is clear is that arithmetic mean isn't inherently wrong without some definition of what wrong (or right) is. I've basically seen and done some playing around myself and found that arithmetic mean did some fairly unintuitive things when the ratios are not uniformly >= or <= 1, and that the geometric mean was a better way to capture it in a single number - I think thats what these papers are getting at. Would be interesting to formalize it though.
Still more work to be done, but I think the basic design of the Criterion-style benchmarking library is close to finished: https://github.com/johnmyleswhite/Benchmarks.jl/
@IainNZ I think you really want to use (possibly weighted) geometric means when you're working with ratios. If on one machine 0.4 ran twice faster than 0.3, and another one twice slower, the mean should be sqrt(1/2 * 2) = 1
, not (1/2 + 2)/2 = 1.25
.
To backup what @nalimilan said, search for the paper "How not to lie with statistics: the correct way to summarize benchmark results".
This being the internet, we can also just link to things: Fleming1986 is the paper in question.
Just to drive you all a bit crazy :grinning:, have you thought of multiprocess benchmarking? Most all of my benchmarking experience was in benchmarking operations using all of the cores, because single process benchmarking doesn't give a good indicator of systemwide performance. Let's say you have algorithm A, which is 2x as fast as algorithm B, when run on a single process, but it uses 64K of temp array space... whereas algorithm B requires only 2K... are you really sure that A is going to do better when you have 1000 processes on a 32 core machine? Usage of your L1/L2 and even L3 caches becomes critical... cache line ping-ponging can also make or break scalability...
Ahh @GlenHertz @pao, that was the paper I was thinking of!
@nalimilan yes that was what I was saying.
FWIW, here's how I think one should handle these kinds of problems: http://eytan.github.io/benchmarking.pdf
At least in operations research (surprised Iain hasn't mentioned this yet) the standard presentation format for benchmarks over large test sets with a variety of large and small problem instances is Dolan-More performance profiles: http://arxiv.org/pdf/cs/0102001.pdf
That format of presenting results is much more valuable when you're comparing more than 2 alternatives though.
Oh man how did I forget that! Seriously, performance profiles are some interesting stuff once you wrap your head around them.
Just to add to the confusion: Citron et al., in their The harmonic or geometric mean: does it really matter?, have an interesting historical timeline of “The War of the Means,” which seems to have been going on for a while in the hardware community. I also kinda like Bast and Weber's Don't Compare Averages.
I think Eugster, Hothorn and Leisch have some interesting solutions in their Exploratory and Inferential Analysis of Benchmark Experiments, which they have implemented in the R package benchmark. Many of these solutions avoid the use of averages, by summarizing all the data in other ways, e.g., histograms if the occurrence of each of the compared methods at each ranking, from first to last, or by constructing a consensus ranking across the various benchmarks (although that raises the question of how to weigh them in constructing the consensus, I guess).
BTW: Kudos & thanks to @johnmyleswhite for Benchmarks.jl
! Always nice to cross stuff out on my wishlist :smile_cat: :gift:
see #13893
To repeat what others have said, you've done a really impressive job putting this together @jrevels. :applause:
Thanks!
As discussed in #11683, it would be great to revive some kind of performance regression testing so that we don't slowly accumulate perf problems.
In an ideal world, we'd standardize on a system that non-Base packages could use as well. But for now it would be good to just get Base being tested for perf rigorously again.