2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

[Spike: 2hr] Learn how to upgrade k8s version of clusters running on Azure #4669

Closed sgibson91 closed 1 month ago

sgibson91 commented 2 months ago

Context

Upgrading the k8s version the control planes and nodepools of our clusters is an ongoing maintenance task, but we currently do not have documentation/policies on how to manage this for our Azure clusters. This spike will tell us what options we have available to us and inform how we want to canonically approach Azure k8s version upgrades going forward.

https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster (and links therein) should provide some helpful info.

Task list

No response

Definition of Done

Pre-defined Definition of Done

yuvipanda commented 2 months ago

This is reserved for @sgibson91 to work on.

sgibson91 commented 1 month ago

Used 25mins of the spike to provide similar terraform functionality to that of GCP in providing an output listing the latest supported versions

sgibson91 commented 1 month ago

There are automatic upgrade options for k8s on Azure, but I'm not sure what our policy on using them is. E.g., do we prefer to be in control of when we upgrade to communicate potential outages? My instinct is saying yes.

ETA: I see we have explicitly disabled the release channels in GCP too so that verifies my feeling that we want to do this manually.

sgibson91 commented 1 month ago

Terraform treats k8s version upgrades an update-in-place:+1:

Rough plan for Azure k8s upgrades

  1. Use terraform output to establish which version to upgrade to. I can't see anything that wouldn't allow us to go straight from 1.28 to 1.30, for example. https://learn.microsoft.com/en-us/azure/architecture/operator-guides/aks/aks-upgrade-practices#cluster-upgrades
  2. Upgrade control plane. In tf config, pin defined node pools to current k8s version in node_pool variable. Use kubernetes_version variable to define the new version for the control plane. Run tf plan & tf apply.
  3. Upgrade core pool. Remove k8s version pin in node_pool definition (added in step 2) and re-run tf plan & tf apply.
  4. Upgrade user pool. Same as step 3 for the remaining nodepools.

If the update happens in place, I don't think we'll need to worry about rolling and recreation upgrades like in AWS (where all the nodepools get destroyed and recreated to be upgraded).