Closed balsoft closed 1 week ago
Here are my preliminary notes on our desired CI system, @mkaito , could you massage this into some human readable text?
These depend both on the source and target.
CI accomplishes functional requirements poorly because of the low quality.
CI runs on 180 buildkite agents on GKE, split over 215 c2-standard-16 VMs on google cloud. Autoscaling is enabled but not working properly. There are two build queues, default and integration. The integration tests run on beefier machines.
CI runs a lot of builds, mostly using different docker images. Deployment jobs are problematic because of the pre-emptions.
Coherence is maintained with shared artifact and deployment targets.
Build runners (and integration test nodes) are often killed by GCP or k8s, causing a build failure that has to be retried.
The entire build is done for every CI run, even if nothing changed. (An attempt is made to mitigate this with 'monorepo triage').
A lot of the CI steps are running the same build process. It would save money if these steps could be shared.
Currently, all builds can access all credentials, and the credentials are very powerful. This is mitigated with the 'ci-build-me' label on github, which is neccesary to run the CI. However, token leaks are still an issue.
Some tests, like "merges cleanly into develop", depend on the current state of 'unrelated' branches. This is a problem for CI predictability and caching.
Require commit signatures. Disallow PRs into anything but develop/compatible/master/release.
Minimum viable CI runs 4-6 buildkite nodes on a non-preemptible c2-standard-30 (30 cores, 120GB memory) VM. We can expand this later. There should be 2 build queues, a default one and a deployment one. The deployment build queue should have a single agent (to prevent interference) and security credentials to do deployments.
Mostly nix builds, which are deterministic and offer caching. We can also create docker images this way. Deployment jobs should run on a different build queue with a single agent and security credentials, so they don't interfere. Differential jobs (i.e. branching constraints) should also specify a clear dependency and only depend on the current state of the branch. (corresponding develop/compatible revs should be found using git-merge-base, etc.)
We can use a shared nix cache, but scaling beyond a single machine may involve build orchestration by transforming the nix dependency graphs into a buildkite pipeline on-demand. https://github.com/serokell/common-infra
@yorickvP It's great to have these notes here to prompt some thought by others. Thank you. I propose that the next action is to gather as a team (those who would work on things related to Nix CI) and construct a set of GitHub issues that are a breakdown of the work into efforts that would sum to give the vision that you collectively develop for this (Epic) work.
The vision shared in this issue, @michal0mina , need not necessarily be shared by you. It may be that this issue is not necessary to complete the "Nix-Enabled CI" project. At your discretion.
Nix-based build, test and deployment system for Mina.