Open CMCDragonkai opened 2 years ago
Going to dump some notes on investigating this issue.
In JS, the standard for benchmarking is the https://github.com/bestiejs/benchmark.js/issues library.
This has been extended and made usable via TypeScript by https://github.com/caderek/benny which is what we have been using.
However we have been using it without fully understanding the complexity built into benchmarking.
Firstly, benny
gives us a b.suite
which is meant to construct a suite of comparable benchmarks. You can create independent suites to mean different "incomparable" dimensions of a system. But for everything that should be compared together should be part of a suite.
A suite is also executable. However we wrap it in our main
function to better organise our process execution, and export them to allow us to run all the benchmarks in one go. At any case, it is possible to do npm run ts-node -- ./benches/some-suite.ts
.
Suites can contain a set of benchmarks that are run in-order just like jest tests.
Just like jest, each benchmark can have a afterEach
. This is the cycle
function. By default it just prints progress results to the TTY. However you can pass a callback to receive a single case result, and also the entire summary as first and second parameters. Don't use this to do teardown, there is no "teardown" available through benny (even though benchmarkjs supports this concept). Only use this to report per-benchmark results, or do something with the per-benchmark results.
Then there's the complete
which is similar to afterAll
in jest. This runs at the very end of all benchmarks and reports on the summary. Use this the same way as cycle
, it just means to do some reporting on the summary. It is possible to use complete
as a custom saver function, instead of the b.save
functions which only support JSON, CSV... etc, for your own custom reporting format. Which is what we will use to report in the OpenMetrics prometheus text format which is usable by Gitlab metrics reporting.
Now as for the benchmarking itself, it is actually quite complex. These resources are relevant:
So the basic idea is that:
benny
, there is no setup nor teardown configured. So there's only the test function iterations.minTime
in floating point seconds which by default is 0.05 which is 50msmaxTime
which is floating point seconds which by default is 5 which is 5000ms.minSamples
which will be a number of samples to RUN before doing additional automatic sampling. So by default that's 5
. So it will always run 5 samples no matter what.initCount
which may be used during analysis phase.. I'm not sure.The end result is that how fast an operation is, is measured by ops per second with a margin of error. This number is derived by averaging all the samples it has acquired. A sample is not 1 run of the test function. But can be N runs of the test function, and N is the number of times the test function can be executed within minTime
seconds.
When running benny, you should see that it takes about 5 seconds per benchmark, because that's the maxTime
configuration. And you should see by default always about 100 samples being acquired, perhaps a little less, since the time measurement isn't perfect. I think it also it may run N for longer than minTime
if the test function iterations haven't reached less or equal to 1% margin of error/uncertainty. So there's alot of "smarts" built into it. This means I reckon most of the time we will have 100 or LESS number of samples.
Important to note benny's setup closure is executed a couple times. Make sure it is idempotent. It doesn't do this for analysis, it just does this to get the type of the output, and to figure out how to run it. If you need to do setup that cannot be idempotent, do it before creating the b.suite
.
The prometheus text format is pretty easy to work with.
We'll use a structure like:
# HELP suitename this is a logger
# TYPE suitename gauge
suitename{name="benchmarkname"} 3755997
Where the number is the just the ops per second.
In the future we can upgrade to histogram or summary which we have information for as well, however Gitlab atm doesn't have any reporting UX that understands histograms or summaries.
We could also report some other information too... like samples
and margin. Maybe something like:
# TYPE suitename_ops gauge
suitename_ops{name="benchmarkname"} 3755997
suitename_ops{name="benchmarkname2"} 37559
# TYPE suitename_margin gauge
suitename_margin{name="benchmarkname"} 0.34
suitename_margin{name="benchmarkname2"} 0.15
# TYPE suitename_margin counter
suitename_samples{name="benchmarkname"} 96
suitename_samples{name="benchmarkname2"} 97
The timestamp for a metric is optional.
Still useful though.
Then in our benches/index.ts
it just aggergates all these metrics reports together to form a single metrics report, and this is submitted with a job:
metrics:
script:
- echo 'metric_name metric_value' > metrics.txt
artifacts:
reports:
metrics: metrics.txt
This can be done for each build platform so we can get information from each platform.
If the benchmarks take long to run (cause 5 seconds roughly per test). At the very least for a minute, there'd be time to do 12 benchmarks. So it should be suitable.
This is currently being trialled out here: https://github.com/MatrixAI/js-logger/pull/19.
As one can see, gitlab's metrics report only occurs on gitlab's MR. So we would need to update the github bot to report this information to the relevant PR.
We could auto-report to the staging PR, benchmark information.
Some additional information is relevant here...
When doing benchmarking, in order to get reproducible results you need:
performance
CPU frequency governor in order to maintain consistent CPU frequency and not ondemand
nor schedutil
Changing to performance
is easy just by doing:
cpupower frequency-set -g performance
Then also for disabling boosting:
echo 0 > /sys/devices/system/cpu/cpufreq/boost
You can switch it back afterwards to what it is.
For example (do this in a sudo -i
otherwise it shows incorrectly):
root@matrix-ml-1:~/ > cpupower frequency-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 2.20 GHz - 3.70 GHz
available frequency steps: 3.70 GHz, 3.20 GHz, 2.20 GHz
available cpufreq governors: ondemand performance schedutil
current policy: frequency should be within 2.20 GHz and 3.70 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: 3.70 GHz (asserted by call to hardware)
boost state support:
Supported: yes
Active: no
Boost States: 0
Total States: 3
Pstate-P0: 3700MHz
Pstate-P1: 3200MHz
Pstate-P2: 2200MHz
Next use these tools to find out what's the current state:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
And:
watch -n 0.1 grep MHz /proc/cpuinfo
The end result is that you get a "consistent" CPU frequency", which should give you more consistent benchmarks.
I did notice inconsistent runs whenever I was running the benchmark giving me sometimes 12 million ops in one time, but then another time 8 million ops. Laptops in particular will be doing CPU throttling slowing down the system when off power. See https://discourse.nixos.org/t/how-to-switch-cpu-governor-on-battery-power/8446/7
So benchmarks when done on my desktop should have these done. But when done on the CI/CD, we won't have control over these things anyway, so we won't know, so best benchmarks should still be done on matrix-ml-1
for consistency.
Relevant:
If boost is turned on, you'll get better results in some times, and then when it hits the thermal limit, you may get worse results later. It's just some what inconsistent.
This script:
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
if [[ "$@" == "" || "$@" == *-h* || "$@" == *--help* ]]; then
cat<<EOF
fixed-cpu-run - Temporarily set CPU frequency governor to performance and disable CPU boost
Intended for reproducible benchmarks
Usage:
fixed-cpu-run <command>
fixed-cpu-run -h | --help
Options:
-h --help Show this help text.
EOF
exit 64
fi
governor="$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)"
boost="$(cat /sys/devices/system/cpu/cpufreq/boost)"
cleanup () {
sudo sh << EOF
set -o errexit
set -o nounset
set -o pipefail
cpupower frequency-set -g "$governor" >/dev/null
echo "$boost" > /sys/devices/system/cpu/cpufreq/boost
EOF
}
trap cleanup EXIT
sudo sh << "EOF"
set -o errexit
set -o nounset
set -o pipefail
cpupower frequency-set -g performance >/dev/null
echo 0 > /sys/devices/system/cpu/cpufreq/boost
EOF
"$@"
I've written to support running something temporarily with the right CPU settings. So we can do fixed-cpu-run npm run bench
.
Atm, we don't have any additional bot integration of both tests and benchmark reports. Both reports are just left on the system, and the GH bot does not report this to our PR. If we start using gitlab MRs that would be a different situation as they have native integration.
So for now, we will consider this to have been partially implemented at least for js-logger
. It's benchmarking style can be propagated to the other projects when necessary.
For future work, we will need a GH bot that acquires the artifacts, and pushes them to be rendered as a comment on the PR, possibly rendering a markdown table and with graphics.
It's also important to be able to compare against previous runs on the CI/CD, so we could use https://docs.gitlab.com/ee/ci/yaml/#needsproject to refer to the same job on the previous pipeline and same project, to get a the previous metrics report.
Then a diff of the 2 files can be done to show what has changed between the metrics. See: https://github.github.com/gfm/ for more information on what can be displayed. Markdown tables, html tables.
Benny can also generate 'json' | 'csv' | 'table.html' | 'chart.html'
. Most likely CSV conversion to markdown will be the easiest to do.
I would also suggest outputting the benchmark results to ./tmp/X
where X
will be some important job name. This means the metric reports can be gathered into one job to do the GH bot reporting to avoid conflicts.
The same idea could be used for test reporting. https://github.com/marketplace/actions/publish-unit-test-results
Will leave this to a later date to pursue and solve together with unit test reporting.
The complexity of doing this sort of reporting would be onerous on each project. It's best to abstract pipeline configuration for all the projects to that they can inherit common jobs like this #48. Scripts can be provided within the nix-shell
, but that would require Nix provisioning of script data which won't work on windows or macos for now. Perhaps one might distribute tools directly in the docker gitlab-runner image. Like for example https://github.com/davidahouse/junit-to-md
Also benchmark results would then be considered an artifact, and not something to be committed to the repo. Although it'd be nice to have a place to publish these benchmark results somewhere that is persistent, and still works for interactive usage. At the very least CI/CD benchmarks would be just temporary/per-pipeline artifacts. I'm more talking about benchmark results produced interactively to place them somewhere...
I've had to update the script to deal with intel cpus which is slightly different. Changes located here: https://github.com/CMCDragonkai/.dotfiles-nixos/commit/fe44e07350a287613e25fd2c561a0cc04679574b Recommend using if doing any benchmarking.
As well as on the vostro laptops you probably want to change to using the powersave
governor to save battery @emmacasolin @tegefaulkes, it will be variable frequency so will increase frequency when more load is occurring. Depends on your usecase in case battery usage is an issue. See the dell-vostro-5402
repo.
We should be using vega cli to generate static graphics or SVGs that can then be presented by the bot.
I've just updated the structure on Polykey's benchmarking that looks like:
[nix-shell:~/Projects/Polykey]$ tree ./benches/
./benches/
├── index.ts
├── results
│ ├── git
│ │ ├── gitgc.chart.html
│ │ ├── gitgc.json
│ │ └── gitgc_metrics.txt
│ ├── keys
│ │ ├── key_generation.chart.html
│ │ ├── key_generation.json
│ │ ├── key_generation_metrics.txt
│ │ ├── random_bytes.chart.html
│ │ ├── random_bytes.json
│ │ ├── random_bytes_metrics.txt
│ │ ├── symmetric_crypto.chart.html
│ │ ├── symmetric_crypto.json
│ │ └── symmetric_crypto_metrics.txt
│ ├── metrics.txt
│ └── system.json
├── suites
│ ├── git
│ │ └── gitgc.ts
│ └── keys
│ ├── key_generation.ts
│ ├── random_bytes.ts
│ └── symmetric_crypto.ts
└── utils.ts
6 directories, 20 files
Note that the utils.ts
now produces some useful functions like: fsWalk
and the ability to parse the file names.
Basically, we get automatic mapping of the suite relative paths to the results relative paths. But also the open metrics limits us to using dot separator (e.g. keys.blah
) instead of /
as a separator.
So the metric names look like keys.symmetric_crypto
... etc.
In the future if we change our names, we have to be aware that breaks any long term viewing.
And also we can can make use of git log to find metric changes and load those files too. Will need to investigate how to use the git
tool to bring a specific file from a previous commit later.
Specification
Now that gitlab supports metrics reports in its CI/CD (https://docs.gitlab.com/ee/ci/metrics_reports.html), we should be able to also integrate our benchmarking suite into this.
This should then replicate this github action: https://github.com/marketplace/actions/continuous-benchmark.
And then show us the benchmark with our MatrixAI-Bot on the PR.
The benchmarks would re-run whenever we submit a commit, but we could make use of "incremental computation" so that benchmarks only re-run if code that affects it has been changed. This is more complicated, as how does one track if a script needs to run, unless you can track all the dependencies.
Alternatively, we re-run them on specific "commit" tags. Like the way we use
[ci skip]
, we could do[ci bench]
to trigger the bench.Similar issues occur with submitting commits and re-running tests that are affected by those changes - https://suncommander.medium.com/run-jest-for-unit-tests-of-modified-files-only-e39b7b176b1b.
Benchmarking is useful for finding performance regressions. Our benchmarking suite would need to generate a file compatible with "open metrics".
Furthermore, profile reports are useful to find out out about GC, memory pressure and CPU usage. The only issue is that it depends on what exactly we are profiling. For a sequential script that's possible, but for an application with multiple endpoints, each endpoint execution may have their own profiles that is relevant. Therefore the profiles are ultimately something that could be done along with benchmarks, since for each benchmark, we can also generate a profiling report to analyse. We can get nodejs to generate a profiling report from V8, and then use https://github.com/jlfwong/speedscope to visualise it. The CI/CD system can push "pages" to github pages, that update information, and one could visit the repo's speedscope for profile reports for any given benchmark.
This should be done for any benchmarking reports too. We should make use of the github pages more, since right now github will auto generate the pages deployment upon changes to the
docs
directory undermaster
branch. All that would need to be done is to update assets to thedocs
directory, or to update "artifactsthat URLs in the
docs` directory points to.Additional context
Tasks