ci(benchmarking): test the variance of the benchmarking rig

dshemetov commented 1 year ago

The new Locust CI system by @rzats is very cool and convenient! In order to increase our trust in the system, we should get some baseline numbers about how it works.

My main question is: what sort of variance should we expect in the benchmark numbers from run to run?

There are many factors that could contribute to this variance, including:

the hardware that GH provisions us for the particular job run
the proximity of the hardware to our staging database
the load on GH's networks
the load on the hardware that GH provisions us (since CI runs in a multi-tenant VM)

Without worrying about specifically any of these factors, but instead approaching the system as a whole, I propose that we do the following to test:

open up a new PR
choose a large query set (~1000 queries)
trigger the benchmarking CI once every X minutes for about Y hours (e.g. X = 10, Y = 12)
compile the output and get a sense of the variance in the benchmarks over time

Running the benchmarks over a long time will help us see how much GH load affects us. Running the benchmarks over a short time will help us see how much variance there is even with the same GH load.

EDIT: h/t @rzats found this link on GH Actions Perf Stability

rzats commented 1 year ago

@dshemetov I used this temporary version of the workflw https://github.com/cmu-delphi/delphi-epidata/pull/1253 for testing its timing, running the same set of ~1000 queries used in most previous tests every 30 minutes for 24 hours. Here's how the results look:

If the two outliers (a run that failed right away due to a Docker website 503 and a run that timed out) are removed:

This is pretty substantial variance, so it looks like moving this workflow to a self-hosted runner could be the way to go.

dshemetov commented 1 year ago

Thanks @rzats! That is quite a bit of variance. Hopefully something self-hosted would be more stable. Also hoping that we can set jobs on a self-hosted runner to queue instead of colliding (for example, if two people run perf tests on two separate branches at the same time).

melange396 commented 1 year ago

can you give us a list of the running times? itd be nice to see a proper calculation of mean/stdev/variance, and maybe plot the distribution.

also, these times are for the entirety of the process, right? (clone+build+run) is there any more or less variability if you isolate just the locust running time?

rzats commented 1 year ago

@melange396 with some slightly fancier API calls I managed to get the same chart for the Locust step of the runs only. It looks essentially identical compared to the previous chart, so that step is where most of the variance comes from.

And here's a Seaborn distribution plot for that set of runs; there were only ~48 of them, so the distribution doesn't look quite Gaussian :)

The mean runtime was ~405.3 s, with a std of ~48.3 s and variance of 2330 "square seconds".

rzats commented 1 year ago

@dshemetov @melange396 the latest test run (with the API keys being applied and checked) looks a bit more stable - it's pretty much a Gaussian distribution around a ~330s median runtime for the Locust step.

dshemetov commented 1 year ago

Are these results from our self-hosted rig or from GH machines?

rzats commented 1 year ago

@dshemetov at this point we're using the self-hosted runner.

cmu-delphi / delphi-epidata

ci(benchmarking): test the variance of the benchmarking rig #1206