Closed mkaito closed 2 months ago
The yaml configuration file lists all accounts that Buildkite has access to as ENV values. I have added a comment next to each one to note how it is used.
https://github.com/MinaProtocol/mina-deployments/blob/main/src/buildkite/buildkite-central1.yaml
There were a few conversations for how to best separate or protect the access that is granted. The favored idea currently was to use a system such as Hashicorp Vault to sit in between and issue temp tokens (zero knowledge access) on demand, but adding this needed additional resources/planning.
Current status: @mkaito: Started working on porting the vault-secrets stuff to GCP
While I do that, let me explain what I'm doing.
I'm working on updating a HashiCorp Vault deployment that I wrote for my previous employer for use at Mina on GCP. vault-secrets
is not involved. While that is based on the same architecture, vault-secrets
is a NixOS module to manage server-level secrets.
@yorickvP wrote a buildkite hook that pulled secrets from a known path in Vault, constructed from a combination of BK variables such as repo and pipeline name, and injected them into the runner environment. As hooks run at the pipeline level, this made secrets effectively pipeline-isolated.
In practice, this means that an operator can place a secret in a specific path in Vault and this will then automatically be available in matching pipelines. This can be done via the vault
CLI or the web interface. I recommend the former as I have little experience with the web interface.
Initial Vault and Consul setup can be quite involved, but the system is quite nice to use at runtime.
Vault is, in essence, a REST API for secrets. It doesn't store secrets, it just provides the REST API, along with various authentication mechanisms, and various system backends (you can plug things like SSH or PostgreSQL into Vault as an auth backend). Storage is handled by Consul, which also does much more than just storage.
While both Vault and Consul are very powerful tools, the usage that concerns us here is actually very simple, and requires little involvement.
As Vault does not store any data, Vault nodes can be considered ephemeral. Consul handles all the data, including encryption at rest. Traffic between Vault and Consul should only happen over private network or an encrypted channel.
The data fed to Consul is encrypted by Vault itself. Consul can not read this data, it only stores it.
Vault uses a mechanism called Shamir Secrets, which takes the encryption key and divides it into a number of keys. The number of keys required to carry out an operation can be smaller than the total number of parts. For example, you may have 5 operators, each with their own key, but you may decide to only require 3 of them to be present for privileged action.
Vault also supports "auto unseal", where a trusted cloud service is used instead of operators.
Consul is a networking and configuration management swiss army knife. It does a lot of things. We only want it to provide a fault tolerant HA data backend for Vault.
The smallest possible Consul deployment involves 3 nodes. The nodes are spun up using a pre-shared secret that must be available on the server. This is the only secret in the system that can not be stored in Vault itself, since it must be available before Vault itself can start.
Backups are handled by taking hot snapshots from any Consul node. The data on them is identical and coherent at any given time. This data is still encrypted, and does not need additional security.
While Vault itself is quite stable, we're going to run 2 nodes in HA mode with automatic failover. The nodes talk to each other and decide which one is active. The inactive node will reject all connections and redirect to the other node, but we'll have external health checks to route traffic to the active node.
My experience with this configuration is deploying it on AWS. I have little experience with GCP, but I'm hoping there are sufficient parallels between the two services to allow a similar, hopefully identical configuration. This is what I'm currently working on: porting the deployment to GCP.
Right now Buildkite is configured with admin tokens and some service accounts because it has to publish packages and run automated tests. Is there a more secure way to run tests and builds without giving such open access to the infra systems?