balsoft commented 2 years ago

Nix-based build, test and deployment system for Mina.

yorickvP commented 2 years ago

Here are my preliminary notes on our desired CI system, @mkaito , could you massage this into some human readable text?

Goals

Functional requirements

Artifacts

Docker images
.debs
npm package
Enforce branching constraints

These depend both on the source and target.
"Merges cleanly into develop" - This also depends on the current state of 'develop'.
version compat changes
binable compat changes
Sanity checks
xrefcheck
Check-format
Codeowners, rfcs, snarky submodule
Integration tests

CD
Linting
Helm charts
deploy CI-net
Quality requirements
Speed: CI should complete in less than 20 minutes since code push. -Reliability: CI should fail if, and only if there are code problems.
Actionability: Steps to remedy a CI failure should be clear for any failure.
Reproducability: People should have a some idea what the CI is going to say without having to run it.
Cost: CI should not be overly expensive to run and maintain.
Current Status

CI accomplishes functional requirements poorly because of the low quality.

Substrate

CI runs on 180 buildkite agents on GKE, split over 215 c2-standard-16 VMs on google cloud. Autoscaling is enabled but not working properly. There are two build queues, default and integration. The integration tests run on beefier machines.

CI jobs

CI runs a lot of builds, mostly using different docker images. Deployment jobs are problematic because of the pre-emptions.

Scaling

Coherence is maintained with shared artifact and deployment targets.

Properties

Speed CI takes between 40 minutes and 10 hours, clustering around 1hr.
Reliability CI rarely passes when it should
Actionability Due to the above, 'retry' is often the first reasonable action, followed by "merge upstream into branch".
Reproducability Developers often make PRs just to run the CI.
Cost Cost is exorbitantly high.
Problems

Pre-emption

Build runners (and integration test nodes) are often killed by GCP or k8s, causing a build failure that has to be retried.

No caching

The entire build is done for every CI run, even if nothing changed. (An attempt is made to mitigate this with 'monorepo triage').

Poor parallelism/reuse

A lot of the CI steps are running the same build process. It would save money if these steps could be shared.

Docker-in-docker

Security

Currently, all builds can access all credentials, and the credentials are very powerful. This is mitigated with the 'ci-build-me' label on github, which is neccesary to run the CI. However, token leaks are still an issue.

Differential jobs

Some tests, like "merges cleanly into develop", depend on the current state of 'unrelated' branches. This is a problem for CI predictability and caching.

Proposed

Repo

Require commit signatures. Disallow PRs into anything but develop/compatible/master/release.

CI Substrate

Minimum viable CI runs 4-6 buildkite nodes on a non-preemptible c2-standard-30 (30 cores, 120GB memory) VM. We can expand this later. There should be 2 build queues, a default one and a deployment one. The deployment build queue should have a single agent (to prevent interference) and security credentials to do deployments.

CI Jobs

Mostly nix builds, which are deterministic and offer caching. We can also create docker images this way. Deployment jobs should run on a different build queue with a single agent and security credentials, so they don't interfere. Differential jobs (i.e. branching constraints) should also specify a clear dependency and only depend on the current state of the branch. (corresponding develop/compatible revs should be found using git-merge-base, etc.)

To make it fail faster, don't run integration tests if anything else fails

Scaling

We can use a shared nix cache, but scaling beyond a single machine may involve build orchestration by transforming the nix dependency graphs into a buildkite pipeline on-demand. https://github.com/serokell/common-infra

Properties

Speed: Better caching and parallelism should allow us to hit the 20 minute target.
Reliability: Without the pre-emption failures, we can investigate other build errors much more closely. Flaky integration tests should be disabled until they behave.
Actionability: Improved by better reliability, but CI jobs will have to be tweaked.
Reproducability: Developers can run 'nix build' at home to reproduce CI builds.
Cost: This system is way cheaper in terms of both maintenance, developer overhead and server cost. https://gcpinstances.doit-intl.com/?cost_duration=monthly&selected=c2-standard-30

robinbb commented 2 years ago

@yorickvP It's great to have these notes here to prompt some thought by others. Thank you. I propose that the next action is to gather as a team (those who would work on things related to Nix CI) and construct a set of GitHub issues that are a breakdown of the work into efforts that would sum to give the vision that you collectively develop for this (Epic) work.

robinbb commented 1 year ago

The vision shared in this issue, @michal0mina , need not necessarily be shared by you. It may be that this issue is not necessary to complete the "Nix-Enabled CI" project. At your discretion.

MinaProtocol / mina

[nix-EPIC] Nix-Based CI #10941

Goals

Functional requirements

Artifacts

Enforce branching constraints

Sanity checks

Integration tests

CD

Quality requirements

Current Status

Substrate

CI jobs

Scaling

Properties

Problems

Pre-emption

No caching

Poor parallelism/reuse

Docker-in-docker

Security

Differential jobs

Proposed

Repo

CI Substrate

CI Jobs

To make it fail faster, don't run integration tests if anything else fails

Scaling

Properties