Continuous benchmarking

bartvanerp commented 1 year ago

In the future it would be good to have some kind of benchmarking system in our CI, such that we become aware how changes in our code impact performance. An example of such a system is provided by FluxBench.jl and corresponding website.

bvdmitri commented 10 months ago

Before we commence the actual benchmarking process, it's crucial to conduct preliminary research to determine what tasks we can and should perform ourselves and what components we can potentially leverage from other packages or libraries. Here are the key aspects we need to investigate:

Benchmarking Methodology:
- Research the available benchmarking tools, such as BenchmarkTools.jl and PkgBenchmark.jl, and determine which one is most suitable for our needs. Others?
Benchmark Targets:
- Define what specific aspects we need to benchmark. For example the set of benchmarks will clearly be different for RxInfer and ExponentialFamily. Clearly outline the performance metrics or characteristics that need measurement.
Reporting and Visualization:
- Explore methods for reporting and visualizing benchmark results. Should we use graphical representations, tables, or a combination of both? What libraries can we use for that?
Results Storage:
- Determine where to store the benchmark results to ensure easy access and future analysis.
Benchmark Execution:
- Investigate the feasibility of executing benchmarks using our GitHub runner. Assess the setup process's complexity and determine if it's straightforward to configure.
As much as possible research is appreciated

This task has been added to the milestone for tracking and prioritization.

bvdmitri commented 8 months ago

@bartvanerp This task has been added to the milestone for tracking and prioritization.

bartvanerp commented 8 months ago

Just did some very extensive research:

Benchmarking Methodology: I think PkgBenchmark.jl is the best for creating the benchmark suite. I played around with this for RxSDE.jl a bit and really liked it. This package, however, only tests the execution speed, but I think this metric would be good to start off with. Other metrics, once relevant and implemented, would likely require custom tests anyways.

Benchmark Targets: Let's start off with automatic execution speed as a performance metric. Later on we can extend it, if we have some relevant other metrics and appropriate tests. For me this is beyond the scope of this PR.

Reporting and Visualization: PkgBenchmark.jl automatically generates a report (with differences) between two different commits. There also exists BenchmarkCI.jl to run this on GitHub, but I don't think this will give us reliable performance metrics. FluxBench.jl depends on both, but is likely very much tailored towards Flux.jl, so I am not sure whether this is desirable. For now I propose to just generate the report and to include it manually in a PR, which will be required before having the PR approved. Other reporting/visualization tools will be nice, but probably we will have to implement this ourselves.

Results Storage: If we manually copy them in the PR, then they are saved there. Ideally we want something similar as CodeCov, which just executes the PR and shows the difference report.

Benchmark Execution: I think this post is a nice example of running the benchmarks through CI on the GitHub hosted runner: https://labs.quansight.org/blog/2021/08/github-actions-benchmarks. It is just not quite stable. Furthermore, it will burn through our GitHub minutes. We could hook up a Raspberry Pi (which is not fast, but perhaps that is actually a good thing as we are targeting these devices) as a custom runner: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners.

@bvdmitri @albertpod let me know what you think.

bartvanerp commented 8 months ago

Aside from the real performance benchmakr, we can also already start building a test suit for allocations using AllocCheck.jl, https://github.com/JuliaLang/AllocCheck.jl.

bartvanerp commented 8 months ago

Today we discussed the issue together with @albertpod and @bvdmitri. We agree on the following plan:

All of our packages will need to be extended with a benchmark suite containing performance and memory (allocation) benchmarks. Alternative metrics can be added later once we have developed suitable methods for testing them. @bartvanerp will make a start with this for the FastCholesky.jl package to experiment with it.

Starting in January we will extend the benchmark suites to our other packages and will divide tasks.

For now we will ask everyone to run the benchmarks locally when filing a PR. The benchmarking diff/results will need to be uploaded with the PR. Future work will be to automate this using a custom GitHub runner (our Raspberry Pi), and to visualize results online.

bartvanerp commented 8 months ago

Made a start with the benchmarks for FastCholesky.jl at https://github.com/biaslab/FastCholesky.jl/pull/8.

There is one point which I need to adjust in my above statements: let's skip the extra allocation benchmarks, as these are automatically included in PkgBenchmark.jl.

bartvanerp commented 8 months ago

Coming back to the memory benchmarking: I think it will still be good to create tests for inplace functions, which we assume to be non-allocating, to check whether they are still non-allocating. Kind of like a test which checks @allocated foo() == 0. The AllocCheck package currently does not support this, but the TestNoAllocations package does. Nonetheless, AllocCheck has some PR's which will include this behaviour and supercede TestNoAllcoations: https://github.com/JuliaLang/AllocCheck.jl/issues/59, https://github.com/JuliaLang/AllocCheck.jl/pull/55

bvdmitri commented 8 months ago

I used AllocCheck here. The limitation is that it does check allocations statically, which limits its applications to a very small and type-stable functions. Still useful though

wouterwln commented 4 months ago

Let's make sure we have a benchmark suite boilerplate set up before the 3.0.0 release, such that we can track performance from 3.0.0 onwards

wouterwln commented 3 months ago

I'm moving this to 3.1.0 now, but I suggest we use https://github.com/MilesCranmer/AirspeedVelocity.jl for this. It works with existing ecosystem of BenchmarkTools and PkgBenchmark. Let's investigate if we can bench RMP and GraphPPL behaviour with this as well

ReactiveBayes / RxInfer.jl

Continuous benchmarking #80