faster-cpython / ideas

1.67k stars 49 forks source link

Implement benchmarking infrastructure in terms of GitHub self-hosted runners #506

Closed mdboom closed 1 year ago

mdboom commented 1 year ago

While the existing benchmarking portal infrastructure has served us well, if there were an existing framework to kick off benchmarks that served our needs, we could save on maintenance costs, and possibly take advantage of more "batteries included". The biggest shortcoming of the current code is the lack of Windows support, so it makes sense to explore other options before embarking on that large amount of work. This issue explores the capabilities of using GitHub Actions to replace the existing portal.

(Ticked checkboxes indicate confirmed possible, not completed).

itamaro commented 1 year ago

hey @mdboom, I'm also interested in benchmarking infra "as a service"! specifically, we (at Meta) use a bare metal machine (c5n.metal) in AWS for stable pyperformance runs. this is a very manual and somewhat tedious process, with a lot of room for inconsistencies (e.g. setting up all the cpu isolation and turbo settings and whatnot) and human errors (not comparing with the correct base rev, not building with all the optimizations).

we've been wanting to automate the entire thing, roughly with these requirements:

  1. client-side script for submitting "benchmarking requests" (runnable from Linux & Mac), ideally taking only a pointer to "what to benchmark" (e.g. GitHub fork & branch/tag/commit) (optional "what to compare to" pointer), acting as "fire and forget"
  2. AWS-side "controller" (e.g. a bunch of Lambda scripts + DB) to record benchmarking requests and serve as an endpoint to keep track of progress and get a link to the results
  3. AWS-side controller orchestrates "fulfilling" benchmarking requests by spinning up instances on-demand and taking care of the entire tune machine-build-benchmark-upload results workflow
  4. web UI to visualize benchmarking sessions (track progress, display results)

all the scripts and devop-chops can be fully open source and reusable by anyone as long as they can provide their own AWS account & credentials, but indeed opening up as a "community service" would be problematic.

with the GitHub Actions workflow you propose here + using AWS as self-hosted-runner, do you think what I described above can be achieved? why do you care about running benchmarks on Windows?

carljm commented 1 year ago

why do you care about running benchmarks on Windows?

perf characteristics can be quite different on Windows for a number of reasons, including compiler differences and differing perf characteristics of syscalls, seems reasonable to have visibility into this for a major supported platform (probably the most widely-used one)

mdboom commented 1 year ago

Thanks for your interest, @itamaro.

Probably some context on our current status quo would be helpful. We currently have a couple of "benchmarking" machines (Linux and Mac), which are literal single-purpose machines. We could probably use bare metal offerings from a cloud provider, but, we just have these machines sitting around, and our team is small enough that there's very rarely contention for them. These machines don't open ports directly on the internet -- we have a custom "portal" (that just runs on a tiny container in the cloud) that users use to fire off benchmarks etc. The portal handles submitting jobs to the benchmarking machines, making sure the machines only do one thing at a time, and a bunch of queue management and publishing of results. There's no Web UI. This issue was created due to the realization that there's a lot of overlap between the tasks that the portal does and what GitHub Actions (+ the community of available actions) such that maybe maintaining that ourselves no longer makes sense. There's a bunch of other workflow management systems we could use (e.g. Apache Airflow), but GitHub Actions seems uniquely close to the requirements.

So, yes, I think GitHub Actions + cloud bare metal is likely to meet the requirements you described (but I have no experience with bare metal offerings). I fully intend to share what we can the "how" we build this, and happy to collaborate on it where possible. As you say, actually sharing the infra itself gets tricky, but I'd like to explore that at some point as a follow-on. (The Python community buildbots are a sort of success story there that maybe we could build on).

I agree with @carljm about Windows. There have been some surprising regressions on Windows in the past, mainly due to different compiler behavior. Ideally, we'd like to cover the {MSVC/gcc/clang}-{Linux/Mac/Windows}-{x86_64/ARM64} matrix.

mdboom commented 1 year ago

I have a proof-of-concept working in a private repo for this. The usage is basically going to the "Actions" tab for the repo and clicking the "Run workflow" button:

image

The results appear as artifacts (downloadable from each workflow's run page), and are also uploaded to the same private repo as files. To start running comparisons between results, you just check out the repo locally and run pyperf compare_to.

This same workflow can be used for weekly cronjobs.

Reading through the documentation and code for the portal again, I think the only missing functionality is uploading the weekly results to the ideas repo and producing the summary tables that compare against the Python 3.10 baseline. That could be pretty easily extracted from the existing portal source code.

mdboom commented 1 year ago

For those interested in the complexity of this, I've put the code up here. This public repo isn't connected to any self-hosted runners, so you can't actually see/run the jobs.

bhack commented 1 year ago

https://learn.microsoft.com/it-it/samples/azure-samples/github-runner-on-aks/self-hosted-github-actions-runner-on-aks-azure-kubernetes-service-with-auto-scale-option/

mdboom commented 1 year ago

For those who are interested, here is the code for what we are using in production: https://github.com/faster-cpython/benchmarking-public

(The actual repo with the self-hosted runners is private for security reasons, but the code itself is public so others can learn from and collaborate on it.)