DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Implement high availability control plane #1

Open gothub opened 3 years ago

gothub commented 3 years ago

Maintenance tasks such as k8s upgrades, OS upgrades and re-configurations (disk, etc) can require k8s nodes to be offline for reconfiguration and rebooting.

Minimize k8s service disruptions when these maintenance tasks are performed by:

This issue supercedes https://github.com/NCEAS/metadig-engine/issues/287

gothub commented 3 years ago

Some approaches to implementing a high availability control plane are detailed here

This document discusses both external load balancers (e.g. HAproxy on external nodes) or software load balancing. For the later configuration, keepalive and haproxy run on the control plane nodes, so an external load balancer is not required to switch control to a new active cluster control node in case the current primary becomes unavailable.

With either configuration (external load balancing or internal) extra nodes would need to be added to the cluster that could act as the stand by control nodes.

gothub commented 3 years ago

BTW - the link shown above (https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md) uses kubeadm to implement a 3 control-node HA k8s cluster, with a 'stacked' etcd cluster, or optionally with the etcd nodes external to the cluster.

nickatnceas commented 2 years ago

Two VMs, k8s-ctrl-2 and k8s-ctrl-3 have been provisioned for K8s over in https://github.nceas.ucsb.edu/NCEAS/Computing/issues/98

The physical: virtual layout of the control plane VMs is:

host-ucsb-6: k8s-ctrl-1
host-ucsb-7: k8s-ctrl-2
host-ucsb-8: k8s-ctrl-3
nickatnceas commented 1 year ago

In a Slack discussion we decided to setup backups for K8s and K8s-dev before converting our install to HA.

We may need to upgrade K8s before the HA changes, which in turn may require an OS upgrade on the existing controllers.