Continuous Benchmarking & Profiling

CMCDragonkai commented 2 years ago

Specification

Now that gitlab supports metrics reports in its CI/CD (https://docs.gitlab.com/ee/ci/metrics_reports.html), we should be able to also integrate our benchmarking suite into this.

This should then replicate this github action: https://github.com/marketplace/actions/continuous-benchmark.

And then show us the benchmark with our MatrixAI-Bot on the PR.

The benchmarks would re-run whenever we submit a commit, but we could make use of "incremental computation" so that benchmarks only re-run if code that affects it has been changed. This is more complicated, as how does one track if a script needs to run, unless you can track all the dependencies.

Alternatively, we re-run them on specific "commit" tags. Like the way we use [ci skip], we could do [ci bench] to trigger the bench.

Similar issues occur with submitting commits and re-running tests that are affected by those changes - https://suncommander.medium.com/run-jest-for-unit-tests-of-modified-files-only-e39b7b176b1b.

Benchmarking is useful for finding performance regressions. Our benchmarking suite would need to generate a file compatible with "open metrics".

Furthermore, profile reports are useful to find out out about GC, memory pressure and CPU usage. The only issue is that it depends on what exactly we are profiling. For a sequential script that's possible, but for an application with multiple endpoints, each endpoint execution may have their own profiles that is relevant. Therefore the profiles are ultimately something that could be done along with benchmarks, since for each benchmark, we can also generate a profiling report to analyse. We can get nodejs to generate a profiling report from V8, and then use https://github.com/jlfwong/speedscope to visualise it. The CI/CD system can push "pages" to github pages, that update information, and one could visit the repo's speedscope for profile reports for any given benchmark.

This should be done for any benchmarking reports too. We should make use of the github pages more, since right now github will auto generate the pages deployment upon changes to the docs directory under master branch. All that would need to be done is to update assets to the docs directory, or to update "artifactsthat URLs in thedocs` directory points to.

Additional context

Tasks

...
...
...

CMCDragonkai commented 1 year ago

Going to dump some notes on investigating this issue.

In JS, the standard for benchmarking is the https://github.com/bestiejs/benchmark.js/issues library.

This has been extended and made usable via TypeScript by https://github.com/caderek/benny which is what we have been using.

However we have been using it without fully understanding the complexity built into benchmarking.

Firstly, benny gives us a b.suite which is meant to construct a suite of comparable benchmarks. You can create independent suites to mean different "incomparable" dimensions of a system. But for everything that should be compared together should be part of a suite.

A suite is also executable. However we wrap it in our main function to better organise our process execution, and export them to allow us to run all the benchmarks in one go. At any case, it is possible to do npm run ts-node -- ./benches/some-suite.ts.

Suites can contain a set of benchmarks that are run in-order just like jest tests.

Just like jest, each benchmark can have a afterEach. This is the cycle function. By default it just prints progress results to the TTY. However you can pass a callback to receive a single case result, and also the entire summary as first and second parameters. Don't use this to do teardown, there is no "teardown" available through benny (even though benchmarkjs supports this concept). Only use this to report per-benchmark results, or do something with the per-benchmark results.

Then there's the complete which is similar to afterAll in jest. This runs at the very end of all benchmarks and reports on the summary. Use this the same way as cycle, it just means to do some reporting on the summary. It is possible to use complete as a custom saver function, instead of the b.save functions which only support JSON, CSV... etc, for your own custom reporting format. Which is what we will use to report in the OpenMetrics prometheus text format which is usable by Gitlab metrics reporting.

Now as for the benchmarking itself, it is actually quite complex. These resources are relevant:

So the basic idea is that:

Each benchmark is actually run in 2 phases: analysis and sampling phase
The phases themselves run "cycles".
A cycle consists of setup, test iteration(s), teardown, however in our usage of benny, there is no setup nor teardown configured. So there's only the test function iterations.
I'm not entirely sure about the analysis phase, but it acquires some important configuration by observing the runtime.
During sampling phase: 1 sample represents first doing 1 cycle with 1 test function iteration to check for problems, then doing 1 cycle with N test function iterations. Where N depends... Then afterwards a sample is taken from the aggregate of N test function iterations.
The N depends on some configuration:
- minTime in floating point seconds which by default is 0.05 which is 50ms
- maxTime which is floating point seconds which by default is 5 which is 5000ms.
- The max time divided by min time roughly equals 100 samples.
- So it will run a sample for about 50ms.
- During that 50ms, it will run N test function iterations, where N is as fast as it can with no delay nor count.
There is also minSamples which will be a number of samples to RUN before doing additional automatic sampling. So by default that's 5. So it will always run 5 samples no matter what.
There's also initCount which may be used during analysis phase.. I'm not sure.
There's a delay between cycles, but that doesn't seem important.

The end result is that how fast an operation is, is measured by ops per second with a margin of error. This number is derived by averaging all the samples it has acquired. A sample is not 1 run of the test function. But can be N runs of the test function, and N is the number of times the test function can be executed within minTime seconds.

When running benny, you should see that it takes about 5 seconds per benchmark, because that's the maxTime configuration. And you should see by default always about 100 samples being acquired, perhaps a little less, since the time measurement isn't perfect. I think it also it may run N for longer than minTime if the test function iterations haven't reached less or equal to 1% margin of error/uncertainty. So there's alot of "smarts" built into it. This means I reckon most of the time we will have 100 or LESS number of samples.

Important to note benny's setup closure is executed a couple times. Make sure it is idempotent. It doesn't do this for analysis, it just does this to get the type of the output, and to figure out how to run it. If you need to do setup that cannot be idempotent, do it before creating the b.suite.

CMCDragonkai commented 1 year ago

The prometheus text format is pretty easy to work with.

We'll use a structure like:

# HELP suitename this is a logger
# TYPE suitename gauge
suitename{name="benchmarkname"} 3755997

Where the number is the just the ops per second.

In the future we can upgrade to histogram or summary which we have information for as well, however Gitlab atm doesn't have any reporting UX that understands histograms or summaries.

We could also report some other information too... like samples and margin. Maybe something like:

# TYPE suitename_ops gauge
suitename_ops{name="benchmarkname"} 3755997
suitename_ops{name="benchmarkname2"} 37559

# TYPE suitename_margin gauge
suitename_margin{name="benchmarkname"} 0.34 
suitename_margin{name="benchmarkname2"} 0.15

# TYPE suitename_margin counter
suitename_samples{name="benchmarkname"} 96
suitename_samples{name="benchmarkname2"} 97

The timestamp for a metric is optional.

Still useful though.

CMCDragonkai commented 1 year ago

Then in our benches/index.ts it just aggergates all these metrics reports together to form a single metrics report, and this is submitted with a job:

metrics:
  script:
    - echo 'metric_name metric_value' > metrics.txt
  artifacts:
    reports:
      metrics: metrics.txt

This can be done for each build platform so we can get information from each platform.

CMCDragonkai commented 1 year ago

If the benchmarks take long to run (cause 5 seconds roughly per test). At the very least for a minute, there'd be time to do 12 benchmarks. So it should be suitable.

CMCDragonkai commented 1 year ago

This is currently being trialled out here: https://github.com/MatrixAI/js-logger/pull/19.

As one can see, gitlab's metrics report only occurs on gitlab's MR. So we would need to update the github bot to report this information to the relevant PR.

We could auto-report to the staging PR, benchmark information.

CMCDragonkai commented 1 year ago

Some additional information is relevant here...

When doing benchmarking, in order to get reproducible results you need:

The same machine
The machine should be using performance CPU frequency governor in order to maintain consistent CPU frequency and not ondemand nor schedutil
You probably want to turn off CPU boosting

Changing to performance is easy just by doing:

cpupower frequency-set -g performance

Then also for disabling boosting:

echo 0 > /sys/devices/system/cpu/cpufreq/boost

You can switch it back afterwards to what it is.

For example (do this in a sudo -i otherwise it shows incorrectly):

root@matrix-ml-1:~/ > cpupower frequency-info                       
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 2.20 GHz - 3.70 GHz
  available frequency steps:  3.70 GHz, 3.20 GHz, 2.20 GHz
  available cpufreq governors: ondemand performance schedutil
  current policy: frequency should be within 2.20 GHz and 3.70 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: 3.70 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: no
    Boost States: 0
    Total States: 3
    Pstate-P0:  3700MHz
    Pstate-P1:  3200MHz
    Pstate-P2:  2200MHz

Next use these tools to find out what's the current state:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

And:

watch -n 0.1 grep MHz /proc/cpuinfo

The end result is that you get a "consistent" CPU frequency", which should give you more consistent benchmarks.

I did notice inconsistent runs whenever I was running the benchmark giving me sometimes 12 million ops in one time, but then another time 8 million ops. Laptops in particular will be doing CPU throttling slowing down the system when off power. See https://discourse.nixos.org/t/how-to-switch-cpu-governor-on-battery-power/8446/7

So benchmarks when done on my desktop should have these done. But when done on the CI/CD, we won't have control over these things anyway, so we won't know, so best benchmarks should still be done on matrix-ml-1 for consistency.

Relevant:

If boost is turned on, you'll get better results in some times, and then when it hits the thermal limit, you may get worse results later. It's just some what inconsistent.

CMCDragonkai commented 1 year ago

This script:

#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

if [[ "$@" == "" || "$@" == *-h* || "$@" == *--help* ]]; then
  cat<<EOF
fixed-cpu-run - Temporarily set CPU frequency governor to performance and disable CPU boost
                Intended for reproducible benchmarks

Usage:
  fixed-cpu-run <command>
  fixed-cpu-run -h | --help

Options:
  -h --help       Show this help text.
EOF
  exit 64
fi

governor="$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)"
boost="$(cat /sys/devices/system/cpu/cpufreq/boost)"

cleanup () {
  sudo sh << EOF
set -o errexit
set -o nounset
set -o pipefail
cpupower frequency-set -g "$governor" >/dev/null
echo "$boost" > /sys/devices/system/cpu/cpufreq/boost
EOF
}

trap cleanup EXIT

sudo sh << "EOF"
set -o errexit
set -o nounset
set -o pipefail
cpupower frequency-set -g performance >/dev/null
echo 0 > /sys/devices/system/cpu/cpufreq/boost
EOF

"$@"

I've written to support running something temporarily with the right CPU settings. So we can do fixed-cpu-run npm run bench.

CMCDragonkai commented 1 year ago

Atm, we don't have any additional bot integration of both tests and benchmark reports. Both reports are just left on the system, and the GH bot does not report this to our PR. If we start using gitlab MRs that would be a different situation as they have native integration.

So for now, we will consider this to have been partially implemented at least for js-logger. It's benchmarking style can be propagated to the other projects when necessary.

For future work, we will need a GH bot that acquires the artifacts, and pushes them to be rendered as a comment on the PR, possibly rendering a markdown table and with graphics.

It's also important to be able to compare against previous runs on the CI/CD, so we could use https://docs.gitlab.com/ee/ci/yaml/#needsproject to refer to the same job on the previous pipeline and same project, to get a the previous metrics report.

Then a diff of the 2 files can be done to show what has changed between the metrics. See: https://github.github.com/gfm/ for more information on what can be displayed. Markdown tables, html tables.

Benny can also generate 'json' | 'csv' | 'table.html' | 'chart.html'. Most likely CSV conversion to markdown will be the easiest to do.

I would also suggest outputting the benchmark results to ./tmp/X where X will be some important job name. This means the metric reports can be gathered into one job to do the GH bot reporting to avoid conflicts.

The same idea could be used for test reporting. https://github.com/marketplace/actions/publish-unit-test-results

Will leave this to a later date to pursue and solve together with unit test reporting.

The complexity of doing this sort of reporting would be onerous on each project. It's best to abstract pipeline configuration for all the projects to that they can inherit common jobs like this #48. Scripts can be provided within the nix-shell, but that would require Nix provisioning of script data which won't work on windows or macos for now. Perhaps one might distribute tools directly in the docker gitlab-runner image. Like for example https://github.com/davidahouse/junit-to-md

CMCDragonkai commented 1 year ago

Also benchmark results would then be considered an artifact, and not something to be committed to the repo. Although it'd be nice to have a place to publish these benchmark results somewhere that is persistent, and still works for interactive usage. At the very least CI/CD benchmarks would be just temporary/per-pipeline artifacts. I'm more talking about benchmark results produced interactively to place them somewhere...

CMCDragonkai commented 1 year ago

I've had to update the script to deal with intel cpus which is slightly different. Changes located here: https://github.com/CMCDragonkai/.dotfiles-nixos/commit/fe44e07350a287613e25fd2c561a0cc04679574b Recommend using if doing any benchmarking.

As well as on the vostro laptops you probably want to change to using the powersave governor to save battery @emmacasolin @tegefaulkes, it will be variable frequency so will increase frequency when more load is occurring. Depends on your usecase in case battery usage is an issue. See the dell-vostro-5402 repo.

CMCDragonkai commented 1 year ago

We should be using vega cli to generate static graphics or SVGs that can then be presented by the bot.

I've just updated the structure on Polykey's benchmarking that looks like:

[nix-shell:~/Projects/Polykey]$ tree ./benches/
./benches/
├── index.ts
├── results
│   ├── git
│   │   ├── gitgc.chart.html
│   │   ├── gitgc.json
│   │   └── gitgc_metrics.txt
│   ├── keys
│   │   ├── key_generation.chart.html
│   │   ├── key_generation.json
│   │   ├── key_generation_metrics.txt
│   │   ├── random_bytes.chart.html
│   │   ├── random_bytes.json
│   │   ├── random_bytes_metrics.txt
│   │   ├── symmetric_crypto.chart.html
│   │   ├── symmetric_crypto.json
│   │   └── symmetric_crypto_metrics.txt
│   ├── metrics.txt
│   └── system.json
├── suites
│   ├── git
│   │   └── gitgc.ts
│   └── keys
│       ├── key_generation.ts
│       ├── random_bytes.ts
│       └── symmetric_crypto.ts
└── utils.ts

6 directories, 20 files

Note that the utils.ts now produces some useful functions like: fsWalk and the ability to parse the file names.

Basically, we get automatic mapping of the suite relative paths to the results relative paths. But also the open metrics limits us to using dot separator (e.g. keys.blah) instead of / as a separator.

So the metric names look like keys.symmetric_crypto... etc.

In the future if we change our names, we have to be aware that breaks any long term viewing.

And also we can can make use of git log to find metric changes and load those files too. Will need to investigate how to use the git tool to bring a specific file from a previous commit later.

MatrixAI / TypeScript-Demo-Lib