Implement benchmarking infrastructure in terms of GitHub self-hosted runners

mdboom commented 1 year ago

While the existing benchmarking portal infrastructure has served us well, if there were an existing framework to kick off benchmarks that served our needs, we could save on maintenance costs, and possibly take advantage of more "batteries included". The biggest shortcoming of the current code is the lack of Windows support, so it makes sense to explore other options before embarking on that large amount of work. This issue explores the capabilities of using GitHub Actions to replace the existing portal.

(Ticked checkboxes indicate confirmed possible, not completed).

[X] GitHub Actions supports self-hosted runners that are fully under our control, so we can configure them to be benchmarking compatible
[x] GitHub self-hosted runners use polling to assign work, so they do not need to have open ports -- they just need access to github.com and some other public websites. This should work on MS lab machines.
[X] The Actions should be set up on a private repo, since none of the usual protections for GitHub Actions (a fresh VM each time, etc.) are available. This probably needs to be MS FTE/contractors only. Practically, this is no worse than what we have now, but to be clear, this doesn't solve the "benchmarking as a community resource" problem.
[x] The self-hosted runner can ensure that only one benchmarking run happens at the same time
[x] Cross-platform recipes for installing the base Python (not the benchmarked Python)
[X] Publishing results to a GitHub repository is simple (though publishing to a public one may require some token acrobatics)
[X] Results would be published as downloadable "artifacts" available through the GitHub UI by default
[X] Ability to run "cron jobs" using periodic actions
[X] Actions can be run from the commandline using the gh command, and we could provide a convenience script around this
[X] Visibility into what jobs are currently running, seeing the output in realtime, cancelling jobs etc. is all available on the GitHub website, and some functionality in the gh CLI as well.
[x] The custom setup of pyperformance, especially to run pyperformance and pyston jobs, may be a little fiddly, but should be possible.
[x] Does pyperformance compile support Windows? (Looks like maybe not...) EDIT: It doesn't, but it's easier to just do this as part of GitHub Actions.
[x] Are there other features in the portal we would miss?

itamaro commented 1 year ago

hey @mdboom, I'm also interested in benchmarking infra "as a service"! specifically, we (at Meta) use a bare metal machine (c5n.metal) in AWS for stable pyperformance runs. this is a very manual and somewhat tedious process, with a lot of room for inconsistencies (e.g. setting up all the cpu isolation and turbo settings and whatnot) and human errors (not comparing with the correct base rev, not building with all the optimizations).

we've been wanting to automate the entire thing, roughly with these requirements:

client-side script for submitting "benchmarking requests" (runnable from Linux & Mac), ideally taking only a pointer to "what to benchmark" (e.g. GitHub fork & branch/tag/commit) (optional "what to compare to" pointer), acting as "fire and forget"
AWS-side "controller" (e.g. a bunch of Lambda scripts + DB) to record benchmarking requests and serve as an endpoint to keep track of progress and get a link to the results
AWS-side controller orchestrates "fulfilling" benchmarking requests by spinning up instances on-demand and taking care of the entire tune machine-build-benchmark-upload results workflow
web UI to visualize benchmarking sessions (track progress, display results)

all the scripts and devop-chops can be fully open source and reusable by anyone as long as they can provide their own AWS account & credentials, but indeed opening up as a "community service" would be problematic.

with the GitHub Actions workflow you propose here + using AWS as self-hosted-runner, do you think what I described above can be achieved? why do you care about running benchmarks on Windows?

to clarify, when I say "we at Meta" I'm referring to open source cpython changes we're making, not Cinder :) Cinder perf testing is done in a totally different way based on production Instagram and a lot of internal infra

carljm commented 1 year ago

why do you care about running benchmarks on Windows?

perf characteristics can be quite different on Windows for a number of reasons, including compiler differences and differing perf characteristics of syscalls, seems reasonable to have visibility into this for a major supported platform (probably the most widely-used one)

mdboom commented 1 year ago

Thanks for your interest, @itamaro.

Probably some context on our current status quo would be helpful. We currently have a couple of "benchmarking" machines (Linux and Mac), which are literal single-purpose machines. We could probably use bare metal offerings from a cloud provider, but, we just have these machines sitting around, and our team is small enough that there's very rarely contention for them. These machines don't open ports directly on the internet -- we have a custom "portal" (that just runs on a tiny container in the cloud) that users use to fire off benchmarks etc. The portal handles submitting jobs to the benchmarking machines, making sure the machines only do one thing at a time, and a bunch of queue management and publishing of results. There's no Web UI. This issue was created due to the realization that there's a lot of overlap between the tasks that the portal does and what GitHub Actions (+ the community of available actions) such that maybe maintaining that ourselves no longer makes sense. There's a bunch of other workflow management systems we could use (e.g. Apache Airflow), but GitHub Actions seems uniquely close to the requirements.

So, yes, I think GitHub Actions + cloud bare metal is likely to meet the requirements you described (but I have no experience with bare metal offerings). I fully intend to share what we can the "how" we build this, and happy to collaborate on it where possible. As you say, actually sharing the infra itself gets tricky, but I'd like to explore that at some point as a follow-on. (The Python community buildbots are a sort of success story there that maybe we could build on).

I agree with @carljm about Windows. There have been some surprising regressions on Windows in the past, mainly due to different compiler behavior. Ideally, we'd like to cover the {MSVC/gcc/clang}-{Linux/Mac/Windows}-{x86_64/ARM64} matrix.

mdboom commented 1 year ago

I have a proof-of-concept working in a private repo for this. The usage is basically going to the "Actions" tab for the repo and clicking the "Run workflow" button:

The results appear as artifacts (downloadable from each workflow's run page), and are also uploaded to the same private repo as files. To start running comparisons between results, you just check out the repo locally and run pyperf compare_to.

This same workflow can be used for weekly cronjobs.

Reading through the documentation and code for the portal again, I think the only missing functionality is uploading the weekly results to the ideas repo and producing the summary tables that compare against the Python 3.10 baseline. That could be pretty easily extracted from the existing portal source code.

mdboom commented 1 year ago

For those interested in the complexity of this, I've put the code up here. This public repo isn't connected to any self-hosted runners, so you can't actually see/run the jobs.

bhack commented 1 year ago

https://learn.microsoft.com/it-it/samples/azure-samples/github-runner-on-aks/self-hosted-github-actions-runner-on-aks-azure-kubernetes-service-with-auto-scale-option/

mdboom commented 1 year ago

For those who are interested, here is the code for what we are using in production: https://github.com/faster-cpython/benchmarking-public

(The actual repo with the self-hosted runners is private for security reasons, but the code itself is public so others can learn from and collaborate on it.)

faster-cpython / ideas

Implement benchmarking infrastructure in terms of GitHub self-hosted runners #506