CI for benchmarks online

lukego commented 7 years ago

This repo is cool! I am really happy to have a test suite. This seems great for people who want to maintain their own branches and keep track of how they compare with everybody else's. Like, have I broken something? Have my optimizations worked? Has somebody else made some optimizations that I should merge? etc. Just now I would like to maintain a branch called lowlevel to soak up things like intrinsics and DynASM Lua-mode so this is right on target for me.

I whipped up a Continuous Integration job to help. The CI downloads the latest code for some well-known branches, runs the benchmark suite 100 times for each branch, and reports the results. This updates automatically when any of the branches change (including the benchmark definitions).

The reason I run the benchmarks 100 times is to support tests that use randomness to exercise non-determinism in the JIT, like roulette (#9). Repeated tests mean that we can quantify how consistent the benchmark results are between runs, and once we have a metric for consistency then it is more straightforward to optimize (see luajit/luajit#218).

The branches I am testing now are master, v2.1, agentzh-v2.1, corsix/x64, and lukego/lowlevel. If anybody would like a branch added (or removed) just drop me a comment here. Currently the benchmark definitions are coming from my fork because I wanted to include roulette to check that variation is measured correctly.

Screenshot of the first graph (click to zoom):

benchmarks

and links:

Current results at the time of writing.
Permalink to the latest results.
CI Jobset page where all builds can be found (and also related files like the raw CSV.
Definition of the test runner written in Nix + shell.
Rmarkdown source for the report.

Hope somebody else finds this useful, too! Feedback & pull requests welcome. I plan to keep this operational.

corsix commented 7 years ago

corsix/x64 was effectively merged into v2.1, so I don't expect to be making any more commits to it. corsix/newgc on the other hand...

lukego commented 7 years ago

@corsix Roger. I updated the config to test newgc instead of x64. The results will automatically go up on the permalink above.

lukego commented 7 years ago

Is it hopelessly naive to simply run the benchmarks by evaluating them with no arguments? https://github.com/lukego/LuaJIT-branch-tests/blob/5043523d6cb59d35e7ecf5ee51f2253ab75d8675/default.nix#L57. I suppose that I should at least save the output to check if they are really working. Some execute very quickly.

@corsix do you need any special build options for newgc?

MikePall commented 7 years ago

@lukego Maybe you missed those bench/PARAM* files that contain the N arguments to each benchmark? Scale as appropriate to give a run time of a couple seconds each. No point in running these more than a dozen times.

Consider verifying the checksum of the benchmark output against known good checksums for each N. E.g. generated with plain Lua or the C equivalents of the tests (you really need this for larger N).

Note that mandelbrot suffers from numerical instability and may give different results, depending on fused vs. unfused FP arithmetics on some platforms (JIT-compiled, i.e. fused is actually more accurate). And partialsums depends on the accuracy of a couple of math library functions, which isn't very good on some platforms.

lukego commented 7 years ago

@MikePall Aha! Thanks for pointing out bench/PARAM*. Just the thing.

For me it is important to run tests 100+ times and to seed them with entropy. While we have issues like luajit/luajit#218 to contend with I think that benchmark results need to be interpreted as probability distributions rather than scalar values.

(The non-determinism is perhaps more important to me than to others. In the Snabb context we absolutely cannot have a situation where you deploy 100 routers and expect 5 of them to have half the capacity of the others. People are currently using lousy workarounds like detecting system overload and calling jit.flush() to roll the dice on a new trace. I need to find a proper solution to this & the CI has to show me improvements and regressions in how dependable performance is in the presence of workload entropy.)

lukego commented 7 years ago

I have updated the CI to run from PARAM_x86_CI.txt from my branchmarks branch. This is closely based on PARAM_x86_CI.txt but I removed a couple that seemed to fail or hang.

The results permalink is the same. Hopefully the report is beginning to be meaningful. Now each benchmark takes between 0.1s and 10s which is hopefully a reasonable range for getting stable and meaningful results.

I have pulled the iteration count down to 12 from 100. The Relative Standard Deviation graph probably needs to be taken with a grain of salt. I will revisit this when time permits. (Just now I am running all the iterations in a bash loop which ties up a test server continuously. I should make each run into a separate Nix derivation so that the CI will schedule them intelligently e.g. parallelize across more servers and interleave with other CI tasks instead of blocking them.)

Notable difference by eyeball is that the report is no longer flagging corsix/newgc as slower on the binary-trees benchmark. Previously this benchmark was only running for around 0.001 seconds and so the difference may well have been due to some tiny constant factor.

SameeraDes commented 5 years ago

I am trying to run the benchmarks in continuous integration job for Aarch64 port which is in v2.1. Is there any central CI system to which the Aarch64 tests be added, or I need to setup completely new CI job for the same?

nico-abram commented 5 years ago

@lukego https://hydra.snabb.co/build/3807227 errors with "Aborted: cannot connect to ‘root@murren-1.snabb.co’: ssh: connect to host murren-1.snabb.co port 22: Connection timed out (propagated from build 3807225) " This (https://hydra.snabb.co/build/3803719) seems to be the most recent passing build

lukego commented 5 years ago

@nico-abram ah yes! The compute hosts running these LuaJIT benchmarks have recently been retired. I didn't think of this job because I haven't seen much activity here over the past few years and don't know how much interest there is.

If you want to run the benchmarks locally and generate the report you can use the instructions in the RaptorJIT README that I hope will work with standard LuaJIT too. I'm happy to advise if someone wants to troubleshoot a local setup or run a new CI.

If someone wants to sponsor running and updating a benchmark CI for LuaJIT then I'm also happy to help with that in my professional capacity at Snabb Solutions.

P.S. Here are some of the other ways that I put these tests to use while exploring the contribution of individual optimizations to overall performance:

Validating the HOTCOUNT table https://github.com/raptorjit/raptorjit/issues/56.
Validating LuaJIT optimizations https://github.com/raptorjit/raptorjit/issues/46
Validating LuaJIT micro-optimizations https://github.com/raptorjit/raptorjit/issues/48

That last one turned up a potentially important micro-optimization:

md5 benchmark 15% speedup by removing "slow LEA" https://github.com/raptorjit/raptorjit/issues/48

Surprisingly interesting to take simple benchmarks and use them to make systematic experiments!

lukego commented 5 years ago

@SameeraDes Good question. This CI is based on Nix and Nix seems to support ARM these days. So it should be possible to add an ARM server onto the backend but I don't know how much hassle to expect. The sticky-tape solution could also be for random machines to post results to Git repos in plain text and for this CI to download those are build/publish the reports.

I am meaning to migrate over to https://www.hercules-ci.com/ but haven't made time for that yet.

SameeraDes commented 5 years ago

Thanks for your response, @lukego I have added CI based on Jenkins for ARM64 for now. It would be great if we can have central CI for all LuaJIT perf runs, I am willing to contribute for ARM64 port.

siddhesh commented 5 years ago

@lukego we have set up a CI loop for luajit on the Linaro CI to run tests on commits to v2.1 on arm64:

https://ci.linaro.org/job/luajit-aarch64-perf/

We'll be happy to add an x86_64 node to it if you have one, or add an x86_64 node ourselves.

As for other architectures, please feel free to ping me either on this issue or personally to have more nodes added to the trigger. At some point we also need to figure out a place to report the results.

lukego commented 5 years ago

@siddhesh Cool!

I am running a CI for RaptorJIT and related projects that sometimes covers LuaJIT too. I don't have spare machines to contribute to other CIs like yours though so please go ahead with your own.

LuaJIT / LuaJIT-test-cleanup

CI for benchmarks online #10