aenix-io / cozystack

Free and Open Source PaaS-platform for seamless management of virtual machines, managed Kubernetes, and Databases-as-a-Service
https://cozystack.io
Apache License 2.0
735 stars 36 forks source link

Rolling update for Talos nodes #20

Open kvaps opened 6 months ago

kvaps commented 6 months ago

During cluster setup user have to upload secrets.yaml and cluster.conf into Kubernetes. eg:

kubectl create secret generic -n cozy-system cozy-talos-bootstrap --from-file secrets.yaml --from-file cluster.conf

This will start a reconcilation controller, which checks all the nodes in a cluster, and performs their rolling update:

talosctl -e <node_address> -n <node_address> upgrade --preserve=<true|false> -i <image>

During the upgrade talos config on the node should also be updated to contain the new image. In talos-bootstrap script we usually do that immediately before the update operation.

kvaps commented 6 months ago

I think we need to introduce a new resource to track the process of updates:

apiVersion: cozystack.io/v1alpha1
kind: TalosNode
metadata:
  name: srv1
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: srv1
    uid: d7ba2238-d45a-4edc-a348-aa5157afc730
spec:
  image: ghcr.io/aenix-io/cozystack/installer:v0.0.2
  suspend: false
status:
  image: ghcr.io/aenix-io/cozystack/installer:v0.0.2
  lastAppliedPatchHash: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b

talos updater controller should take the folowing arguments:

--image=ghcr.io/aenix-io/cozystack/installer:v0.0.2
--preserve=false
--patch-file=common-node-parameters.yaml

When new version of cozystack installed, it will update talos-updater to run with new arguments, then it should go and update all the nodes in a cluster one-by-one.

common-node-parameters.yaml contain parameters that must be applied as merge patch, example:

machine:
  kubelet:
    nodeIP:
      validSubnets:
      - 192.168.100.0/24
  kernel:
    modules:
    - name: openvswitch
    - name: drbd
      parameters:
        - usermode_helper=disabled
    - name: zfs
  install:
    image: ghcr.io/aenix-io/cozystack/talos:v1.6.4
  files:
  - content: |
      [plugins]
        [plugins."io.containerd.grpc.v1.cri"]
          device_ownership_from_security_context = true      
    path: /etc/cri/conf.d/20-customization.part
    op: create

cluster:
  network:
    cni:
      name: none
    podSubnets:
    - 10.244.0.0/16
    serviceSubnets:
    - 10.96.0.0/16
  allowSchedulingOnControlPlanes: true
  controllerManager:
    extraArgs:
      bind-address: 0.0.0.0
  scheduler:
    extraArgs:
      bind-address: 0.0.0.0
  proxy:
    disabled: true
  discovery:
    enabled: false
  etcd:
    advertisedSubnets:
    - 192.168.100.0/24

Only one node per cluster allowed for upgrade in time