berops / claudie

Cloud-agnostic managed Kubernetes
https://docs.claudie.io/
Apache License 2.0
587 stars 40 forks source link

Feature: Terraform Templates Overrride #1441

Closed Despire closed 1 day ago

Despire commented 2 months ago

Problem:

How TF works - if you do a breaking change in TF files, TF replaces the resources. In our case, if the nodes are replaced (e.g. due to an Ubuntu image update), the nodes are destroyed by Terraform and then they're re-created. However, after re-creation, they're not part of the K8s cluster. The old nodes have not been gracefully removed from the cluster either, therefore the control plane sees them as NotReady and the users can't recover from this manually. Now, we need to be able to change Terraform code for nodepools. This is needed in case of upgrading the OS version in the TF files or in case of adding new features that come with breaking changes in TF HCL code.

Proposal:

Ideally, we should find a way to deliver breaking changes in TF via rolling updates. That is, we create a new nodepool using the new TF code and then remove the old nodepool built using the old TF code. This way the cluster can sustain such updates without downtime. One way how to achieve this would be the following. We'll separate the TF code into a new repository. This would be a GitHub/berops repository by default, but we allow users to override it. We also enable users to specify the repository tag and commit hash. Unless the tag has been overridden by the user, the repository tag will match the running Claudie release version (e.g. 0.8.1). If the tag is specified, Claudie will use the TF code from the particular commit with the tag. If the tag has not been specified and Claudie defaults to the tag matching the current Claudie release version, upon an upgrade, Claudie will automatically do a rolling update by deploying the nodepool with the same configuration and TF code from the new release (e.g. 0.8.2). However, if the tag was specified by the user to 0.8.1, Claudie upgrade to 0.8.2 would do no change to the nodepool infra, as the user pin of the TF code has a higher priority

Effects: