CI Performance Tracking for v0.5

jrevels commented 8 years ago

As progress moves forward on v0.5 development (especially #13157), we'll need an automated system for executing benchmarks and identifying performance regressions.

The Julia group has recently purchased dedicated performance testing hardware which is hosted at CSAIL, and I've been brought on to facilitate the development of a system that takes advantage of this hardware. I'm hoping we can use this issue to centralize discussion/development efforts.

Desired features

Any implementation of a performance tracking system should:

track multiple metrics (e.g. time, % GC time, bytes allocated)
allow collaborators to trigger the following via GitHub's UI:
- benchmark execution on a specific commit
- benchmark result comparison between two specific commits
incorporate a tagging system so that groups of benchmarks can be selectively run by topic (e.g. run only parallel benchmarks, or only string benchmarks)
store benchmark result history for future analysis
report benchmark status/results through GitHub's UI
be applicable to normal Julia packages, not just Base (envision a future where PackageEvaluator benchmarks registered packages)
preferably be written in native Julia

Feel free to chime in with additional feature goals - the ones listed above just outline what I've been focusing on so far.

Existing work

In order to make progress on this issue, I've been working on JuliaCI/BenchmarkTrackers.jl, which supplies a unified framework for writing, executing, and tracking benchmarks. It's still very much in development, but currently supports all of the goals listed above. I encourage you to check it out and open up issues/PRs in that repository if you have ideas or concerns. Just don't expect the package to be stable yet - I'm in the process of making some drastic changes (mainly to improve the testability of the code).

Here are some other packages that any interested parties will want to be familiar with:

johnmyleswhite/Benchmarks.jl: a framework for benchmark execution, with the goal of delivering more rigorous/useful metrics than those provided by simple calls to @time.
JuliaWeb/GitHub.jl: will soon contain the GitHub event-handling infrastructure that I've been working on, which allows stuff like comment-based triggers and GitHub status reporting.
IainNZ/BasePerfTests.jl: a collection of performance tests (I'm not sure that this is still being developed).
staticfloat/Perftests.jl: another collection of performance tests + a grouping/logging framework for ease of execution. I believe @staticfloat has been running these on occasion via the buildbots.

Eventually we will want to consolidate "blessed" benchmarking/CI packages under an umbrella Julia group on Github (maybe JuliaCI)?

I saw that Codespeed was used for a while, but that effort was abandoned due to the burden of maintaining the server through volunteer effort. I've also been told that Codespeed didn't integrate well with the Github-centric CI workflow that we've become accustomed to.

Resolving this issue

Taking into account the capabilities previously mentioned, I imagine that a single CI benchmark cycle for Base would go through these steps:

Collaborator comments on a commit with the appropriate trigger phrase
A comment-listening server (the meat of which I've implemented here) triggers ~~the buildbots~~ a CI server to build Julia at that commit
~~The buildbots call out to~~ BenchmarkTrackers is used to process an external package of performance tests (similar to Perftests, but written using BenchmarkTrackers)
BenchmarkTrackers reports the results via GitHub statuses to the commit. Any additional reporting can be done with the @nanosoldier bot ("nanosolider" is the hostname of the hardware at CSAIL, and @jakebolewski managed to grab the GitHub account name).

If we can deliver the workflow described above, I think that will be sufficient to declare this issue "resolved."

Next steps

The following still needs to get done:

[x] Configure a webhook on the JuliaLang/julia repo for firing off events so that we can begin testing the necessary CI infrastructure
[x] Get BenchmarkTrackers.jl to a release-ready state, which means more robust testing and further development on both itself and its dependencies, GitHub.jl and Benchmarks.jl
[x] Start organizing/rewriting existing performance tests using BenchmarkTrackers.jl
[x] Set up the @nanosoldier to do any additional status reporting we want (e.g. linking to a commit status page after somebody triggers benchmark execution - GitHub doesn't do a good job of displaying statuses for commits outside of PRs).
Regression Examples

Here are some examples from regression-prone areas that I think could be more easily moderated with an automated performance tracking system:

parallel (#12794, #12223)
inlining/type inference (#13551, #13350, #12476)
"real-world" use cases (#11823, #7000, #6415)
individual functions (#11700, #10650, #8100, #6112)
codegen/vectorization (#13777, #13301, #11899, #11997)

musm commented 8 years ago

Will this include tracking performance on all three major operating systems (mac, linux distro,windows)? To identify possible regressions affecting only one system?

jakebolewski commented 8 years ago

No, performance tracking will be limited to running on Ubuntu Linux (similar to Travis). We don't have the manpower / resources to do cross platform performance testing at least initially.

Part of the goal is to make all this testing infrastructure lightweight / modular enough that it can be run on a user's computer with minimal effort (another reason to use Julia and not something like CodeSpeed). This way volunteer's (or organizations) could plug gaps in systems we don't support through automated CI while using the same benchmarking stack.

xianyi commented 8 years ago

+1 I think I can use this infrastructure to track OpenBLAS performance, too.

jrevels commented 8 years ago

Will this include tracking performance on all three major operating systems (mac, linux distro,windows)?

In the future, if we had something like a periodic (e.g. weekly) benchmark pass separate from the "on-demand" CI cycle being discussing in this issue, we might consider doing a full OS sweep on occasion (not per-commit, though).

But as @jakebolewski pointed out, we're not really concerned with cross-platform tracking at the moment, especially given our current resource limitations.

StefanKarpinski commented 8 years ago

Let's get it working on Linux. Running it regularly on other OSes can be a later goal.

tkelman commented 8 years ago

We can turn the webhook on now if you know how it'll need to be configured yet.

jrevels commented 8 years ago

I'm not sure yet what the payload URL is going to be, but I'll keep you posted with the details once we figure it out.

staticfloat commented 8 years ago

Hey guys, sorry to be MIA for the last week or two. @jrevels asked me in private a little while ago to write up my plan about performance testing that I am 30% through enacting, and so I am going to data-dump it here so that everyone can see it, critique it, and help shape it moving forward to make something equally usable by all.

I'm personally not so concerned with the testing methodology, statistical significance etc... of our benchmarking. We have much more qualified minds to duke that out, what I'm interested in is the infrastructure; how do we make this easy to setup, easy to maintain, and easy to use. Here's my wishlist/design doc for performance eval of Base; this is completely separate from package performance tracking which is of lower priority IMO.

Runs automatically on whitelisted branches, but can be manually run by contributors

Right now, 90% of what I do with Julia revolves around creative abuse of buildbot. I have a system setup where my Perftests.jl package gets run on every commit of master that makes it through the test suite. That package dumps results to .csv files, a sample of the last month of those runs is available as a tarball here, for the time being (Warning: ~700MB large because it contains the data from every single sample taken during a Benchmarks.jl run). It's not unthinkable that we could have a [perftest] tag in a git commit message (that was committed by a contributor) to trigger a perf test, or some other kind of github integration. (I don't know enough about Github integration to know what the best use case here is; if we want to run/rerun a perftest after the fact, can we make a button to do so? Is the way to do it to run it manually via buildbot, and then update a github status somewhere?)

Is independent of Base and runs on every version that we care about

Right now, that includes 0.4 and 0.5, but could possibly include 0.3. Obviously, there will be tests that we don't want to run on older versions, or even tests that we will drop as APIs change. But having our test infrastructure independent of Base (unlike the old test/perf/ directory) will make it a lot cleaner to maintain, and especially easier to fix problems in our test infrastructure without messing around with Base.

This is why I made the Perftests.jl package, but real life caught up to me faster than I thought, and so that repository, while functional, is missing some Sparse tests that the old test/perf/ directory had. Other than that, it's what I would consider "functional".

Is stored in a location easy for anyone and anything to access

I like making pretty visualizations. But I am nothing compared to what the rest of the Julia community is capable of, and I'd really like to make getting at and visualizing our data as easy as possible. To me, that means storing the data in something robust, public-facing, and easily queried. For our use cases, I think InfluxDB is a reasonably good choice, as I don't think reinventing the database wheel is a good use of our time, and it provides nice, standard ways of getting at the data.

In my nanosoldier/Perftest.jl world, my next step would be to write a set of Julia scripts that parse the .csv files generated by Perftests.jl, distill them down into the metrics that we would want to keep around forever (which is a much smaller set of data than what is generated in the .csv files right now) and upload them to an InfluxDB instance. I've got the uploading part done, I just haven't made a way to parse the .csv files and get the relevant bits out.

That server, being designed for timeseries and publicly available, would likely function better than anything we would cobble together ourselves, and would open the path to writing our own visualization software (a la codespeed) to even using something that someone else has already written (Kibana, Grafana, etc...).

That's been my plan, and I'm partway toward it, but there are some holes, and I'm not married to my ideas, so if others have alternative plans I'd be happy to hear them and see how we can most efficiently move from where we are today, to where we want to be. I am under no illusions that I will be able to put a significant amount of work toward any proposal, so it's best if the discussion that comes out of this behemoth of a github post is centered around what others want to do, rather than what I want to do. Either that, or we just have patience until I can get around to this.

tkelman commented 8 years ago

Sounds about right. I'd prefer a tagged comment listener hook (@nanosoldier perftest foo) rather than having to put specific things in the commit message.

hayd commented 8 years ago

Another option might be a "run perftests" github label. Edit: ah, but you wouldn't be able to specify foo.

jrevels commented 8 years ago

Thanks for the write-up, @staticfloat. I've definitely been keeping in mind the things we've discussed when working on BenchmarkTrackers. I'd love it if you could check out the package when you have the time.

It's not unthinkable that we could have a [perftest] tag in a git commit message (that was committed by a contributor) to trigger a perf test, or some other kind of github integration.

I've been advocating the trigger-via-comment strategy because I think it will encourage more explicitly targeted benchmark cycles that will make better use of our hardware compared to a trigger-via-push strategy. One is still be able to trigger per-commit runs by commenting on the commit with the appropriate trigger phrase, and that way you don't have to clutter up your commit messages with benchmark-related jargon.

In my nanosoldier/Perftest.jl world, my next step would be to write a set of Julia scripts that parse the .csv files generated by Perftests.jl, distill them down into the metrics that we would want to keep around forever (which is a much smaller set of data than what is generated in the .csv files right now) and upload them to an InfluxDB instance. I've got the uploading part done, I just haven't made a way to parse the .csv files and get the relevant bits out.

The logging component that BenchmarkTrackers uses for history management is designed to be swappable so that we can support third-party databases in the future. It currently only supports JSON and JLD serialization/deserialization, but there's nothing stopping us from extending that once we get the basic CI cycle going.

StefanKarpinski commented 8 years ago

I've been advocating the trigger-via-comment strategy because I think it will encourage more explicitly targeted benchmark cycles that will make better use of our hardware compared to a trigger-via-push strategy. One is still be able to trigger per-commit runs by commenting on the commit with the appropriate trigger phrase, and that way you don't have to clutter up your commit messages with benchmark-related jargon.

Yes, please. The whole push-triggered model is so broken. Just because I pushed something doesn't mean I want to test it or benchmark it. And if I do, posting a comment is not exactly hard. I do think that we should complement comment-triggered CI and benchmarking with periodic tests on master and each release branch.

jakebolewski commented 8 years ago

I added a benchmark tag so we can tag performance related PR's that need objective benchmarks.

IainNZ commented 8 years ago

@jrevels I don't have the bandwidth to meaningfully contribute to this, but my BasePerfTests.jl package serves a similar purpose to @staticfloat's, in that it was a thought experiment for disconnecting performance tests from the Julia version, and what a culture of adding a performance regression test for performance issues would look like (analogous to adding regression tests for bugs)

jrevels commented 8 years ago

CI performance tracking is now enabled! There are still some rough edges to work out, and features that could be added, but I've been testing the system on my Julia fork for a couple of weeks now and it's been stable. Here's some info on how to use this new system.

The Benchmark Suite

The CI benchmark suite is located in the BaseBenchmarks.jl package. These benchmarks are written and tagged using BenchmarkTrackers.jl. Currently, I've only populated the suite with a couple of the array indexing benchmarks that are already in Base. Those benchmarks were enough to test the CI server, but we'll obviously want more variety as soon as possible. The suite now contains all the benchmarks currently in test/perf. Filling out the suite will be my priority for the next few weeks. As always, PRs are welcome!

Triggering Jobs

Benchmark jobs are submitted to MIT's hardware by commenting in pull requests or on commits. Only repository collaborators can submit jobs. To submit a job, post a comment containing the trigger phrase runbenchmarks(tag_predicate, vs = ref). The proper syntax for this trigger phrase can be demonstrated by a few examples (each quoted block is an example of a comment):

I want to run benchmarks tagged "array" on the current commit.

`runbenchmarks("array")`

If this comment is on a specific commit, benchmarks will run on that commit. If
it's in a PR, they will run on the head/merge commit. If it's on a diff, they will
run on the commit associated with the diff.

I want to run benchmarks tagged "array" on the current commit, and compare the
results with those of commit 858dee2b09d6a01cb5a2e4fb2444dd6bed469b7f.

`runbenchmarks("array", vs = "858dee2b09d6a01cb5a2e4fb2444dd6bed469b7f")`

I want to run benchmarks tagged "array", but not "simd" or "linalg", on the
current commit. I want to compare the results against those of the release-0.4
branch.

`runbenchmarks("array" && !("simd" || "linalg"), vs = "JuliaLang/julia:release-0.4")`

I could've compared against a fork by specifying a different repository (e.g.
replace "JuliaLang/julia:release-0.4" with "jrevels/julia:release-0.4").

I want to run benchmarks tagged "array", but not "simd" or "linalg", on the
current commit. I want to compare the results against a fork's commit.

`runbenchmarks("array" && !("simd" || "linalg"), vs = "jrevels/julia@c70ab26bb677c92f0d8e0ae41c3035217a4b111f")`

The allowable syntax for the tag predicate matches the syntax accepted by BenchmarkTrackers.@tagged.

Examining Results

The CI server communicates back to the GitHub UI by posting statuses to the commit on which a job was triggered (similarly to Travis). Here are the states a commit status might take:

Pending: The job is queued up. Another pending status is posted once the job begins running.
Failure: The job completed, and performance regressions were found.
Success: The job completed, and no performance regressions were found.
Error: The job could not be completed. User-facing reasons for this include build errors and malformed trigger phrase syntax.

Failure and success statuses will include a link back to a report stored in the BaseBenchmarkReports repository. The reports are formatted in markdown and look like this. That's from a job I ran on my fork, which compared the master branch against the release-0.4 branch (I haven't trawled through the regressions caught there yet).

Note that GitHub doesn't do a very good job of displaying commit statuses outside of PRs. If you want to check the statuses of a commit directly, I usually use GitHub.jl's statuses method, or you could go to the commit's status page via your browser at api.github.com/repos/JuliaLang/julia/statuses/:sha.

Rough Edges/Usage Tips

Only one job submission can be made per comment (e.g. only the first use of runbenchmarks(...) is picked up).
If your trigger phrase syntax is malformed, you may not get a status reply at all, since the CI server won't pick up the comment.
I'm still working on measuring and reducing the noise/variance on our hardware. The regression determination method might be improved (or at least more finely tuned) once we have more data.
Storing/reporting results in a GitHub repository is fine initially, but eventually we'll want a proper database for long-term storage (and probably a better front end).
Obviously, you shouldn't trigger jobs on commits that don't build successfully on Travis.
Jobs triggered on commits/branches that introduce changes which break BaseBenchmarks.jl will necessarily result in error (I'm going to try to keep that package as forward-compatible as possible).

Finally, I'd like for anybody who triggers a build in the next couple of days to CC me when you do so, just so that I can keep track of how everything is going server-side and handle any bugs that may arise.

P.S. @shashi @ViralBShah @amitmurthy and anybody else who uses Nanosoldier: nanosoldier5, nanosoldier6, nanosoldier7, and nanosoldier8 should now be reserved exclusively for CI performance testing.

tkelman commented 8 years ago

Sounds great. How easy will it be to associate reports in https://github.com/JuliaCI/BaseBenchmarkReports to the commit/pr they came from? Posting a nanosoldier response comment (maybe one per thread with edits for adding future runs?) might be easier to access than statuses, though noisier.

jrevels commented 8 years ago

The reports link back to the triggering comment for the associated job, and also provide links to the relevant commits for the job.

Going the other way, clicking on a status's "Details" link takes you to the report page (just like clicking on the "Details" link for a Travis status takes you to a Travis CI page). That only works in PRs, though.

I'm onboard for getting @nanosoldier to post automated replies on commit comments (that's the last checkbox in this issue's description). I'm going to be messing around with that in the near future.

timholy commented 8 years ago

Exciting stuff, @jrevels!

staticfloat commented 8 years ago

This is really cool @jrevels, so glad you've taken this up.

jrevels commented 8 years ago

I just updated the CI tracking service to incorporate some recent changes regarding report readability and regression detection. I've started tagging BenchmarkTrackers.jl such that the latest tagged version corresponds to the currently deployed version.

The update also incorporates the recent LAPACK and BLAS additions to BaseBenchmarks.jl (they're basically the same as the corresponding benchmarks in Base). Similar to BenchmarkTrackers.jl, I've started tagging the repo so that it's easy to see what versions are currently deployed.

I depluralized the existing tags (e.g. "arrays" --> "array"), as I'm going to try to consistently make them singular in the future. Additionally, one can now use the keyword ALL to run all benchmarks (e.g. runbenchmarks(ALL, vs = "JuliaLang/julia:master")).

jrevels commented 8 years ago

Responding here to discussion in #14623.

I think I would rather see packages add a benchmarks/ directory and have the benchmark infrastructure be able to pull those all in.

This is exactly the intent of BenchmarkTrackers. Recently, my main focus has been setting up infrastructure for Base, but the end-goal is to have package benchmarks be runnable as part of PackageEvaluator. If they want to get a head start on things, package authors can begin using BenchmarkTrackers to write benchmarks for their own package.

Yeah, depends on how much control you want over the benchmarks. If we go that route it feels that it should be included in Base somehow á la Pkg.benchmark().

After we use the existing infrastructure for a while, we could consider folding some unified version of the benchmarking stack (BaseBenchmarks.jl + Benchmarks.jl + BenchmarkTrackers.jl) into Base, and have a Base.Benchmarks module that formalizes this stuff across the board for the language. The two reasons why that would be useful:

Benchmark breakage as a result of language changes could be resolved within the PRs causing the breakage. Probably the biggest flaw with the current system is that CI tracking won't work on a PR that breaks BaseBenchmarks.jl.
Automatically creating a benchmarks directory in new packages that could be run with Pkg.benchmark() might help establish performance tracking as a norm for Julia packages.

tkelman commented 8 years ago

That's a lot of code to bring into base, and the wrong direction w.r.t. #5155. I think we can add more automation and levels of testing that aren't exactly Base or its own tests or CI, but would run and flag breakages frequently.

hayd commented 8 years ago

So it would be Benchmarks.pkg("Foo") would look for benchmarks directory in package and run etc. ?

KristofferC commented 8 years ago

It could be put in PkgDev maybe.

My thought here was that the work needed to create a new benchmark repo (BenchmarkPackages.jl if you will) and throw some benchmarks in there from the more well maintained and high quality packages is not that much work. We could do some initial benchmarks between 0.4.2 vs master vs llvm 3.7 and get, at least, an overview of full package performance quite fast.

I'm just being pragmatic here about the time to implement everything and the time to get some results.

@hayd Pkg.benchmark("Foo")?

jrevels commented 8 years ago

That's a lot of code to bring into base, and the wrong direction w.r.t. #5155. I think we can add more automation and levels of testing that aren't exactly Base or its own tests or CI, but would run and flag breakages frequently.

True. It's also too early to tell what will be useful, IMO - BaseBenchmarks.jl needs to be fleshed out quite a bit and the infrastructure needs to settle in.

My thought here was that the work needed to create a new benchmark repo (BenchmarkPackages.jl if you will) and throw some benchmarks in there from the more well maintained and high quality packages is not that much work. We could do some initial benchmarks between 0.4.2 vs master vs llvm 3.7 and get, at least, an overview of full package performance quite fast.

This could be useful in the short term, if you can convince package authors to contribute to it. Long term, I think we should extend PackageEvaluator to run each package's localized benchmarks (though that isn't worth doing until BenchmarkTrackers.jl and Benchmarks.jl are ready for METADATA).

jrevels commented 8 years ago

PSA: ~~The CI server will be down for a few a little while I update it; I'll edit this comment when it's back up.~~ It's back up, with the update.

The update brings better formatted reports - the results table is now sorted by ID and only significant changes are displayed (you can still look at the raw data if you want more info, and a list of all executed benchmarks is given at the end of the report). To view the report in the browser without having to scroll horizontally, you can install the Wide GitHub Chrome Extension (shout out to @KristofferC for pointing this out, it makes all of GitHub look better IMO). Here's a preview of the new format (complete with some actual improvements/regressions).

More importantly, this update includes all* the benchmarks you know and love from julia/test/perf, rewritten using the new framework in BaseBenchmarks.jl. Total turnaround time for a job running all the benchmarks on two different versions of Julia seems to be about 2-3 hours. If you're eager for faster results, you can use the tag predicate system previously described, otherwise the ALL keyword can be used as usual.

*~~Well, most. I didn't include the 3D Lattice Boltzmann benchmark since I saw that it relied on non-Base code.~~ It's been added.

jrevels commented 8 years ago

The CI tracker has been updated once again. In addition to a slew of new benchmarks, the new version of the tracker causes @nanosoldier to comment in a PR (or on a commit) once a job has completed. This comment contains a status message, a link back to the original job-triggering comment, and a link to the job's markdown report. It also cc's me, so I can easily keep tabs on the server as jobs are completed.

While it was suggested that multiple jobs from the same PR/commit result in comment edits rather than entirely new comments, I ended up going the other way - one comment per job, even if there are multiple jobs in the same context. This was way simpler to implement, and allows us to get email updates when jobs complete.

Now that every checkbox in this issue has been ticked, I'm going to consider it resolved.

Of course, I'm going to keep adding benchmarks to BaseBenchmarks.jl as they crop up (I still have some on my list right now). If necessary, I'll notify the community of drastic changes to the CI tracker by posting an update here.

Any bug reports, suggestions, or improvements regarding CI benchmarks/infrastructure should be filed as issues/PRs at BaseBenchmarks.jl or BenchmarkTrackers.jl.

Thanks for all the help/discussion along the way!

jiahao commented 8 years ago

:cake: :fireworks: :dancers: :dart: :100:

ViralBShah commented 8 years ago

Fantastic work!

I also wanted to ask if this infrastructure could potentially be adopted to run PkgEvaluator as well. If so, perhaps we can have a different issue.

tkelman commented 8 years ago

PackageEvaluator now lives at https://github.com/JuliaCI/PackageEvaluator.jl, a little bit of refactoring would be needed there to accept commits to test against programmatically. I've been patching that manually for my own runs but shouldn't be too bad to make more flexible.

jrevels commented 8 years ago

It wouldn't be out of the question to separate the infrastructure from the benchmark-specific stuff in BenchmarkTrackers.jl, and put it in a "Nanosoldier.jl" package that could be used to handle multiple kinds of requests delivered via comment. That way all job submissions to @nanosoldier could easily share the same scheduler/job queue.

tkelman commented 8 years ago

That might mean less new semi-duplicated code to write. You apparently already had BenchmarkTrackers set up to be able to use the same nanosoldier node I've been using manually, right?

jrevels commented 8 years ago

For testing new versions of the package, yeah. The master/slave nodes it uses are easily configurable.

jrevels commented 8 years ago

I'll be doing a ForwardDiff.jl sprint next week, but after that I'd be down to work on this - I pretty much know how to do it on the CI side of things. The challenging part might be learning how PackageEvaluator works under the hood, but that doesn't seem like it will be overly difficult.

tkelman commented 8 years ago

I'll help on that side since I'll want to use this right away.

IainNZ commented 8 years ago

I'll also help in terms of providing advice drawn on any lessons I've learnt

hayd commented 8 years ago

@jrevels Does runbenchmarks against a branch (e.g. master) run against the merge-base (e.g. of master and the current commit) or the tip of master?

jrevels commented 8 years ago

If the job is triggered in a PR, benchmarks will run on that PR's merge commit (i.e. the result of the head commit of the PR merged into the PR's base). If there's a merge conflict, and the merge commit doesn't exist, then the head commit of the PR is used instead.

Comparison builds (specified by the vs keyword argument) are always exactly what you specify; either the commit of the given SHA, or the head commit of the given branch.

tkelman commented 8 years ago

We really need to be running this against master on a regular schedule and saving the results somewhere visible. Only getting a comparison when you specifically request one is not a very reliable way to track regressions.

jrevels commented 8 years ago

It's definitely been the plan for the on-demand benchmarking service to be supplemented with data taken at regularly scheduled intervals.

Armed with the data we have from running this system for a while, I've been busy rewriting the execution process to deliver more reliable results, and that work is close to completion (I'm at the fine-tuning and doc-writing phase of development).

After switching over to this new backend, the next step in the benchmarking saga will be to end our hacky usage of GitHub as the public interface to the data and set up an actual database instance, as @staticfloat originally suggested. We can then set up a cron job that benchmarks against master every other day or so and dumps the results to the database.

tkelman commented 8 years ago

Ref #16128, I'm reopening until this runs on an automated schedule.

jrevels commented 8 years ago

An update: @nanosoldier will be down for a day or two while I reconfigure our cluster hardware.

When it comes back up, the CI benchmarking service will utilize the new BenchmarkTools.jl + Nanosoldier.jl stack I've been working on for the past couple of months. The BenchmarkTools package is a replacement for the Benchmarks + BenchmarkTrackers stack, while the Nanosoldier package provides an abstract job submission framework that we can use to add features to our CI bot (e.g. we can build the "run pkgeval by commenting" feature on top of this).

A practical note for collaborators: Moving forward, you'll have to explicitly at-mention @nanosoldier before the trigger phrase when submitting a job. For example, instead of your comment containing this:

`runbenchmarks(tag_predicate, vs = "ref")`

...you'll need this:

@nanosoldier `runbenchmarks(tag_predicate, vs = "ref")`

More @nanosoldier documentation can be found in the Nanosoldier.jl repo.

tkelman commented 8 years ago

What needs to be done to get this running nightly and putting up a report somewhere people can see it?

jrevels commented 8 years ago

The easiest thing to do would just be to set up a cron job that causes @nanosoldier to submit CI jobs to itself on a daily basis. My work during the week has to be devoted to paper-writing at the moment, but I can try to set something up this weekend.

jrevels commented 8 years ago

Starting today, @nanosoldier will automatically execute benchmarks on master on a daily basis. The generated report compares the current day's results with the previous day's results. All the raw data (formatted as JLD) is compressed and uploaded along with the report, so you can easily clone the report repository and use BenchmarkTools to compare any day's results with any other day's results.

jrevels commented 8 years ago

The first daily comparison against a previous day's build has executed sucessfully, so I'm going to consider this issue resolved.

There is definitely still work to be done here - switching over to a real database instead of abusing git, adding more benchmarks to BaseBenchmarks.jl, and making a site that visualizes the benchmark data in a more discoverable way are things that I'd love to see happen eventually. Any subsequent issues - errors, improvements, etc. - can simply be raised in the appropriate project repositories in JuliaCI. As before, we can still use this thread for PSAs to the wider community when user-facing changes are made.

KristofferC commented 8 years ago

Does it make sense to run against previous release as well. I'm thinking else, regressions could be introduced bit by bit were each part is small enough to dissappear in the noise.

tkelman commented 8 years ago

That does make sense to me, though if we're collecting absolute numbers and expect the hardware to remain consistent, probably no need to run the exact same benchmarks against the exact same release version of Julia every day? Maybe re-run release absolute numbers a little less often, once or a handful of times per week?

jrevels commented 8 years ago

Let's continue this discussion in JuliaCI/Nanosoldier.jl#5.

JuliaLang / julia