Open dshemetov opened 1 year ago
@dshemetov I used this temporary version of the workflw https://github.com/cmu-delphi/delphi-epidata/pull/1253 for testing its timing, running the same set of ~1000 queries used in most previous tests every 30 minutes for 24 hours. Here's how the results look:
If the two outliers (a run that failed right away due to a Docker website 503 and a run that timed out) are removed:
This is pretty substantial variance, so it looks like moving this workflow to a self-hosted runner could be the way to go.
Thanks @rzats! That is quite a bit of variance. Hopefully something self-hosted would be more stable. Also hoping that we can set jobs on a self-hosted runner to queue instead of colliding (for example, if two people run perf tests on two separate branches at the same time).
can you give us a list of the running times? itd be nice to see a proper calculation of mean/stdev/variance, and maybe plot the distribution.
also, these times are for the entirety of the process, right? (clone+build+run) is there any more or less variability if you isolate just the locust running time?
@melange396 with some slightly fancier API calls I managed to get the same chart for the Locust step of the runs only. It looks essentially identical compared to the previous chart, so that step is where most of the variance comes from.
And here's a Seaborn distribution plot for that set of runs; there were only ~48 of them, so the distribution doesn't look quite Gaussian :)
The mean runtime was ~405.3 s, with a std of ~48.3 s and variance of 2330 "square seconds".
@dshemetov @melange396 the latest test run (with the API keys being applied and checked) looks a bit more stable - it's pretty much a Gaussian distribution around a ~330s median runtime for the Locust step.
Are these results from our self-hosted rig or from GH machines?
@dshemetov at this point we're using the self-hosted runner.
The new Locust CI system by @rzats is very cool and convenient! In order to increase our trust in the system, we should get some baseline numbers about how it works.
My main question is: what sort of variance should we expect in the benchmark numbers from run to run?
There are many factors that could contribute to this variance, including:
Without worrying about specifically any of these factors, but instead approaching the system as a whole, I propose that we do the following to test:
Running the benchmarks over a long time will help us see how much GH load affects us. Running the benchmarks over a short time will help us see how much variance there is even with the same GH load.
EDIT: h/t @rzats found this link on GH Actions Perf Stability