Benchmarking Proposal - Githubissues

sbuckfelder commented 2 years ago

Problem Statement

There is not currently a consistent way to understand if commits are causing unexpected changes in containerd’s performance. Adding a benchmarking framework and automation mechanisms would allow the project to understand the performance implications of new code commits.

High Level Solution

Benchmarking Framework

We want the framework to produce metrics concerning the latency of different high level actions, such as start, stop, etc... Ideally, the framework will be generic enough that it can be packaged as a library and reused for subprojects (stargz-snapshotter for example).

Proposed Statistics

We want not only high level statistics such as the average, but also distribution information such as percentiles and standard deviation. This will give us confidence in the durability of the statistics and help us identify worst case scenarios. Proposed statistics:

Mean
Standard Deviation
Minimum
25th Percentile
50th Percentile (median)
75th Percentile
90th Percentile
Maximum

Automation Mechanisms

Ideally we would like this to run on code changes and then identify regressions via comparison to previous runs. This will allow us to see performance regressions between commits. To automate this we can use GitHub Actions as our starting point. A few considerations to keep in mind :

Can we leverage GitHub's virtual environments? Or do we need to use a self hosted runner?
- One main concern is if GitHub's virtual environments will give consistent enough results. Inconsistencies could be due to heterogenous hardware in the fleet or noisy neighbors.
Do we want to run every PR? Which branches? Should we add a a tag similar ok-to-test?
- If we run blindly on every PR we might be opening ourselves to malicious code, or just a simple attack of many commits.
How much of this functionality can be used on different automation platforms.
- GitHub artifacts can be retrieved for use on other platforms. Logic in GitHub workflows cannot.

What Shall We Benchmark?

Ideally the benchmarks should be comprehensive across four dimensions: lifecycle steps, platforms, snapshotters, benchmark containers.

Lifecycle Steps

start
stop
pause
exec

Platforms / Architectures

Linux
Darwin
FreeBSD
Windows
x86
ARM64
[need additional]

Snapshotters

overlayfs
devmapper
stargz
[need additional]

Benchmark Containers

busybox (tiny)
[need a large one with many layers]
[need additional]

Proof of Concept Proposal

For a proof of concept build a simple version of the benchmark tool that only operates on a subset of the above dimensions. Lifecycle Steps: start Platform: Linux Snapshotters: overlayfs, devmapper Benchmark Containers: busybox

The resulting metrics will then be compared to the previous run and regressions will be called out via the GitHub Actions interface.

This will help us answer questions concerning the overall framework (how to create easily extendable abstractions) and mechanism (how to best use GitHub Actions)

Open Questions

Expand metrics to include memory usage and IO amount?
What to compare against? Multiple runs? The last release commit?
Benchmarking CRI?

kzys commented 2 years ago

Regarding hosting, we should try https://github.blog/changelog/2022-09-01-github-actions-larger-runners-are-now-in-public-beta/ as well.

estesp commented 2 years ago

Regarding hosting, we should try https://github.blog/changelog/2022-09-01-github-actions-larger-runners-are-now-in-public-beta/ as well.

Good point; I just read that post this morning as well. I assume our CNCF "enterprise"/OSS access will cover those runners, but we may have to dig into that. Definitely seems promising

dcantah commented 2 years ago

Do we want to run every PR? Which branches? Should we add a a tag similar ok-to-test?

In my head, the last portion makes sense. Simple example being we can make a reasonable guess on if a given change will actually effect perf at all before hand, and skip using the compute if it'd be irrelevant (someone just fixed a spelling mistake or made doc changes).

I think benchmarking would make sense to run on the currently supported release branches and main.

containerd / containerd

Benchmarking Proposal #7378