johnmyleswhite commented 9 years ago

As discussed in #11683, it would be great to revive some kind of performance regression testing so that we don't slowly accumulate perf problems.

In an ideal world, we'd standardize on a system that non-Base packages could use as well. But for now it would be good to just get Base being tested for perf rigorously again.

carnaval commented 9 years ago

Here are some things about my dream benchmark suite (maybe too much work, but one can hope) :

Not something you run on one commit and gives you number, but something you run on two commits and reports a comparison.
Something that does not compare numbers from last week with numbers from today. Ideally when you ask to compare two commits, it runs both benchs right away. This avoids the "oh well yeah it seems slower but we changed the RAM on the server yesterday".
I'm a total idiot concerning statistics but I'm pretty sure there are some knowledgeable people in this crowd : I would like something with significance tests, both for single benchmarks (e.g. is the variability low enough compared to the measure, but fancier ?) and for the comparison itself (is the reported inc/decrease significant).

To summarize, we need to have something very robust that we can trust. I don't want to have what we can see right now with everyone handwaving "unrelated" CI crashes which then makes us introduce actual bugs that should have been caught. It is starting to irritate me a bit :-)

yuyichao commented 9 years ago

@carnaval What about The Phoronix Test Suite I've never use it myself but it seems that it is pretty good at comparing two things (commits) and they are pretty serious about it (so they are probably doing the statistics right).

A timeline like what codespeed provide would still be nice though.

KristofferC commented 9 years ago

Couldn't a first step be to simply revive http://speed.julialang.org? Was the problem a lack of finances for hardware or was there significant manual work needed keeping the site up? I think it is easier to get people to write performance test if they have some pretty graphs/numbers to look at. Getting something decent going is good enough while being on the way to the "performance testing nirvana" @carnaval mentioned.

johnmyleswhite commented 9 years ago

All great points. My take on each issue:

The ideal perf monitoring system needs to do both time series analysis and binary comparisons. Binary comparisons are really valuable (and they make the statistical design much, much easier), but they suffer from a failure mode that's, in my experience, very common in practice: the gradual accumulation of small perf regressions that aren't easily detectable individually, but are detectable after say 100 or 1000 commits get merged. You, of course, can detect these cumulative regressions if you make binary comparisons over longer periods of time, but they're often both more obvious and more informative when you see them as a trend over time.
Comparisons across system changes are, as you say, very brittle. In my experience, you can detect things like server hardware changes by running post-hoc A/A tests -- in particular, by rerunning one of last week's commits to confirm that the measurement system is still stable. You also tend to notice such changes from time series graphs, since you see clear trend breaks when the underlying system changes. In general, my preference is to do extensive logging of the measurement system's state since you can often use modeling techniques to correct for these sorts of problems. As your measurement system becomes more complex (e.g. running benchmarks on a fleet of machines), you have to do modeling -- so it's good to do it from the start.
The statistics of one-commit benchmarks are handled extremely well by Bryan O'Sullivan's Criterion library, which we should port to Julia. (Jake already started doing this in the past.) For binary comparisons, the analysis is pretty easy -- unless you want to report comparisons in terms of percentage changes. If you do the latter (which is really helpful for reasoning about directionally-inconsistent changes across multiple benchmarks), you need to use something like Fieller's theorem to generate useful summaries of your comparisons because the non-parametric bootstrap that Criterion employs will generate mildly invalid confidence intervals for percentage changes.
As you say, minimizing false positives is by far the most important thing about a testing system. It's much better to have a regular false negative problem than to teach people that the testing system can't be trusted.

tkelman commented 9 years ago

+1 for porting Criterion, have heard nothing but praise for that library.

Based on @KristofferC's comments here https://github.com/JuliaLang/julia/pull/11683#issuecomment-111875364 I will retract my suggestion of looking into CDash, it looked like it could save some work in just having a place to put large amounts of historical test data, but if it's a Jenkins-style quagmire of configuration files to set up then nevermind. Also looks like they only store 30 days worth of data on free plans.

This jogged my memory, there was a past issue on this at #9456 which was closed without any particular resolution.

johnmyleswhite commented 9 years ago

I've got about half of the work done to get something inspired by Criterion finished. Should have final results tomorrow night.

IainNZ commented 9 years ago

Speed.julialang.org was @staticfloat's baby. At the time we didn't have the build infrastructure we have now, so I think it because it was trying to run on every commit and was also dealing with build issues there were a lot of problems. Many of the perf tests didn't seem to to be set up to run long enough as well, creating some spurious results. It'd probably be easier to get something working now (using e.g. generic nightly linux binaries, like PackageEvaluator does)

IainNZ commented 9 years ago

Would love to write some perf tests too, especially for natural-but-not-highest-possible-perf code (the kind you see on Stackoverflow). Only worth it if we're tracking them.

tkelman commented 9 years ago

On the infrastructure side of plumbing up the github webhooks, we should probably borrow some tools from Rust and use https://github.com/barosl/homu to help address https://github.com/staticfloat/julia-buildbot/issues/1. The buildbots we have at http://buildbot.e.ip.saba.us:8010/builders run surpringly often on master. To start with we should work out a method of pinging a @juliabot (looks like that's taken but inactive, maybe @juliabuild?) in the comments of a PR to fire off perf and/or PkgEvaluator runs on-demand.

IainNZ commented 9 years ago

IIRC @Keno owns @JuliaBot

IainNZ commented 9 years ago

And I was messing around with this: https://github.com/MechaJulia/MechaJulia for a while

Keno commented 9 years ago

I believe I have @julia-bot with a dash.

tkelman commented 9 years ago

MechaJulia's not a bad name, but dash vs no dash would get misspelled way too often.

Realistically buildbot and other projects have already done a lot of work here, there are plenty of off-the-shelf solutions we can start from. The inspired-by-criterion benchmarking harness should probably be written in Julia, but most of the other infrastructure shouldn't IMO.

johnmyleswhite commented 9 years ago

Totally agree with that perspective

johnmyleswhite commented 9 years ago

Didn't manage to finish this work tonight, but the core ideas are in place at https://github.com/johnmyleswhite/Benchmarks.jl

I'd like to make it run a lot faster, but it currently does satisfy my core design constraints:

It should produce numbers comparable to those you'd gather by running handwritten benchmarks that use the @time macro when that macro is appropriate
It should produce plausibly correct numbers for functions that execute faster than the clock's resolution

There's still a lot more work to make it nice to use and, more importantly, to produce correct confidence intervals, but I'm hopeful this design should improve upon both the Benchmark.jl and BenchmarkLite.jl packages.

staticfloat commented 9 years ago

I'm late to the party! Here were/are the challenges with speed.julialang.org:

Build environments getting stuck. This was due to us not having things like the buildbot fleet that we currently have, where we get binary distributions of Julia built reliably. The buildbots were all building Julia themselves, (and failing pretty often, sometimes in a way that dirtied the results for future runs in subtle ways) but now that we can just download/run Julia from scratch (like we do on Travis) this should be fixed.
Dedicated hardware. You can't do performance tests on machines that are used by anyone else for anything else. More annoyingly, you can't do performance tests on VMs living on a server that is shared with other VMs (E.g. no cloud instances). I have a mac mini plugged in next to my desk that isn't doing anything currently, so I have at least one machine that we can crank on 24/7, perhaps that is enough to get going. One (older, OSX) machine is better than nothing, am I right?
codespeed The codespeed project doesn't align with our needs 100%, (e.g. the interface doesn't map to our use case as gracefully as I'd like) and since we don't use a lot of the more advanced features, I always thought it'd be nice to rewrite it in something else, more julia-centered. That of course, will never happen, as I always have more interesting/demanding things on my plate. I can resurrect the speed.julialang.org codespeed instance fairly easily if people are happy with it, but note that we can't do things like, for instance, error bars around datapoints. I was actually contacted by the dev team of codespeed saying that they are currently working on a hosted version of codespeed (like coveralls), but that's likely still months away from anything close to a public release.

johnmyleswhite commented 9 years ago

Yeah, if you're willing to dedicate a machine, that would be awesome. My personal feeling is that doing testing and benchmarking right is one of the best ways that Julia could spend the money that's donated via NumFocus, so I'd like to think we could acquire at least one dedicated machine for Julia as an institution.

I think using codespeed is better than nothing. I guess my question is: do you think their data schema is bad or just their interface? I would be just as happy throwing up a few trivial static graphs that would be easy to generate. The interactive interface doesn't interest me that much.

staticfloat commented 9 years ago

Let's settle on the metrics we want to report, and then evaluate whether codespeed in its current incarnation will serve the purpose properly.

IainNZ commented 9 years ago

@johnmyleswhite, I'd definitely be in favor of static graphs for making the maintenance load on any one person as simple as possible. If someone wants to help out with Codespeed, there is going to be a fixed cost to get orientated that might not be worth it. Just a thought.

staticfloat commented 9 years ago

Perhaps we can make life easy for ourselves by having a simple webpage with, for example, Gadfly plots embedded that show the general trends of things, and then just have the full data for the last N days downloadable as .jld files or something, in case someone wants to get into the nitty-gritty? The philosophy I'd like to have with this is minimal maintenance, but not sacrificing the ability to investigate what exactly is going on.

The reason I mention Gadfly is because It is really, really nice to be able to hover over the first datapoint that shows a regression and see what commit hash that was.

johnmyleswhite commented 9 years ago

I'm on board with all of that.

IainNZ commented 9 years ago

Thats the http://pkg.julialang.org/pulse.html philosophy! :+1:

staticfloat commented 9 years ago

I just took a look at how pkg.julialang.org is run. That is kind of hilarious.

IainNZ commented 9 years ago

Indeed :D For those who don't know, the process is:

1) Cron job launches 4 virtual machines (via Vagrant) on @mlubin's box in our office. 2) Madness happens, JSONs are created 3) I have manually run a script to scp those files to my local machine, bash the JSONs around some more and enhance with more info from Github. This includes enlarging the test history database (a text file), and updating the badges (one file per Julia version per package). 4) I check, run git commit, push. Website is updated 5) Optional: I run another script to that goes through test status degradations and lets me file an issue if I think its appropriate.

devops

kshyatt commented 9 years ago

Can I get more details on step 2, please?

IainNZ commented 9 years ago

Essentially, runs this script: https://github.com/IainNZ/PackageEvaluator.jl/blob/master/scripts/setup.sh to get the VM set up. Then adds the packages, runs code to detect the license and some other stuff, then attempts to find and run the package's tests.

That last bit is a lot of the madness, and it was from before Pkg.test even existed. I'm actually going to rip that out - things should be a lot simpler, but a decent number of packages will shift from the "green" to "blue", because Pkg.test won't find anything.

jakebolewski commented 9 years ago

I like the simplicity of Chapel's simple performance tracking website (http://chapel.sourceforge.net/perf/chap04/). I feel something similar would be more than enough for our needs at this time.

IainNZ commented 9 years ago

Somewhat on-topic, the choice of mean for measuring relative performance is a pretty interesting topic (to me at least), here is one paper: http://mprc.pku.cn/courses/organization/spring2008/PaperReading/Performance%20and%20Evaluation/Characterizing%20computer%20performance%20with%20a%20single%20number.pdf

I see people comparing relative performance with arithmetic means worryingly often...

johnmyleswhite commented 9 years ago

I would argue that the arithmetic mean is generally what you want since it's the only number that trivially extrapolates to the long-term cost of evaluating something repeatedly in production. Cf. Radford Neal's disapproval of the use of medians in one R microbenchmarking paper: https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/

In an ideal world, you'd want to compare the full probability distributions of timings for stochastic dominance, but that's rarely viable.

FWIW, I think the main problem with relative performance via arithmetic means isn't a problem with arithmetic means per se, but with the lack of application of a proper ratio estimator: https://en.wikipedia.org/wiki/Ratio_estimator

IainNZ commented 9 years ago

Arithmetic mean over samples on a single benchmark, sure. I meant, compare Julia 0.4 vs Julia 0.3 on a suite of metrics. Taking arithmetic mean for average speed up of Julia 0.4 over 0.3 on a suite of benchmarks wouldn't be right. That ratio estimator thing is neat, need to look at that.

johnmyleswhite commented 9 years ago

I think I'm confused. You're taking arithmetic mean over a suite of benchmarks where speedups are measured in absolute units or percent change?

IainNZ commented 9 years ago

Like I have 10 benchmarks, and lets just say total time for Julia 0.3 and Julia 0.4 on those benchmarks, total over 100 samples. For each benchmark I have ratio J4 / J3, that (hopefully) measures how much fatster Julia 0.4 is on that benchmark. To get a single number for performance improvement, arithmetic mean of those ratios would not be a good idea (from my understanding), and can give some non-intuitive results that don't really capture what was intended. There seems to be debate whether geometric mean or harmonic mean is more appropriate.

IainNZ commented 9 years ago

So percentage change, not absolute

johnmyleswhite commented 9 years ago

Isn't the arithmetic mean over all of the ratios just some scalarized attempt to solve a multi-objective optimization problem -- and hence one that at least guarantees that your induced objective function would always lead to Pareto improvements? I need to read that paper in depth, but I find it hard to imagine a priori that geometric means or harmonic means are a big improvement when the core difficulty is that you can't know what the relative importance of each benchmark is.

ScottPJones commented 9 years ago

That article from Neal Radford is interesting, but it seems to me that the reason he prefers mean is that the system has a lot of hidden costs (garbage collection) that cause large performance drops on some of the samples, and that needs to be taken into account. I ran into issues benchmarking where looking at the median and MAD (median absolute deviation) worked out better than arithmetic mean and standard deviation, to avoid counting external slowdowns.

ScottPJones commented 9 years ago

I agree with @johnmyleswhite, (at least about the parts I understand!), you can't just take a mean on different benchmark percentages or ratios, the number you get is meaningless, unless you have different sets of weightings for the results, to match how much those operations are used in different use cases.

IainNZ commented 9 years ago

@johnmyleswhite thats an interesting perspective... I guess one thing that is clear is that arithmetic mean isn't inherently wrong without some definition of what wrong (or right) is. I've basically seen and done some playing around myself and found that arithmetic mean did some fairly unintuitive things when the ratios are not uniformly >= or <= 1, and that the geometric mean was a better way to capture it in a single number - I think thats what these papers are getting at. Would be interesting to formalize it though.

johnmyleswhite commented 9 years ago

Still more work to be done, but I think the basic design of the Criterion-style benchmarking library is close to finished: https://github.com/johnmyleswhite/Benchmarks.jl/

nalimilan commented 9 years ago

@IainNZ I think you really want to use (possibly weighted) geometric means when you're working with ratios. If on one machine 0.4 ran twice faster than 0.3, and another one twice slower, the mean should be sqrt(1/2 * 2) = 1, not (1/2 + 2)/2 = 1.25.

GlenHertz commented 9 years ago

To backup what @nalimilan said, search for the paper "How not to lie with statistics: the correct way to summarize benchmark results".

pao commented 9 years ago

This being the internet, we can also just link to things: Fleming1986 is the paper in question.

ScottPJones commented 9 years ago

Just to drive you all a bit crazy :grinning:, have you thought of multiprocess benchmarking? Most all of my benchmarking experience was in benchmarking operations using all of the cores, because single process benchmarking doesn't give a good indicator of systemwide performance. Let's say you have algorithm A, which is 2x as fast as algorithm B, when run on a single process, but it uses 64K of temp array space... whereas algorithm B requires only 2K... are you really sure that A is going to do better when you have 1000 processes on a 32 core machine? Usage of your L1/L2 and even L3 caches becomes critical... cache line ping-ponging can also make or break scalability...

IainNZ commented 9 years ago

Ahh @GlenHertz @pao, that was the paper I was thinking of!

@nalimilan yes that was what I was saying.

johnmyleswhite commented 9 years ago

FWIW, here's how I think one should handle these kinds of problems: http://eytan.github.io/benchmarking.pdf

tkelman commented 9 years ago

At least in operations research (surprised Iain hasn't mentioned this yet) the standard presentation format for benchmarks over large test sets with a variety of large and small problem instances is Dolan-More performance profiles: http://arxiv.org/pdf/cs/0102001.pdf

That format of presenting results is much more valuable when you're comparing more than 2 alternatives though.

IainNZ commented 9 years ago

Oh man how did I forget that! Seriously, performance profiles are some interesting stuff once you wrap your head around them.

mlhetland commented 9 years ago

Just to add to the confusion: Citron et al., in their The harmonic or geometric mean: does it really matter?, have an interesting historical timeline of “The War of the Means,” which seems to have been going on for a while in the hardware community. I also kinda like Bast and Weber's Don't Compare Averages.

I think Eugster, Hothorn and Leisch have some interesting solutions in their Exploratory and Inferential Analysis of Benchmark Experiments, which they have implemented in the R package benchmark. Many of these solutions avoid the use of averages, by summarizing all the data in other ways, e.g., histograms if the occurrence of each of the compared methods at each ranking, from first to last, or by constructing a consensus ranking across the various benchmarks (although that raises the question of how to weigh them in constructing the consensus, I guess).

BTW: Kudos & thanks to @johnmyleswhite for Benchmarks.jl! Always nice to cross stuff out on my wishlist :smile_cat: :gift:

jrevels commented 8 years ago

see #13893

tkelman commented 8 years ago

To repeat what others have said, you've done a really impressive job putting this together @jrevels. :applause:

jrevels commented 8 years ago

Thanks!

JuliaLang / julia

Revive performance regression testing? #11709

devops