berops / claudie

Cloud-agnostic managed Kubernetes
https://docs.claudie.io/
Apache License 2.0
600 stars 41 forks source link

Bug: `kube-eleven` fails to remove control-plane node when using proxy #1519

Closed JKBGIT1 closed 3 weeks ago

JKBGIT1 commented 1 month ago

Current Behaviour

ansibler restarts kube-apiserver and other static pods when updating no proxy envs (see https://github.com/berops/claudie/blob/master/services/ansibler/server/ansible-playbooks/update-noproxy-envs.yml#L10). In the next phase kube-eleven can't elect a leader for a quorum (see https://github.com/berops/claudie/issues/1515) or remove a control-plane node because kube-apiserver pod is down on each control-plane node.

The following logs represent an error when kube-eleven can't remove a specific control-plane node from the k8s cluster.

ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:02 CEST" level=info msg="Determine hostname..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:03 CEST" level=info msg="Determine operating system..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:04 CEST" level=info msg="Running host probes..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:06 CEST" level=info msg="Electing cluster leader..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:06 CEST" level=info msg="Elected leader \"gcp-kube-nodes-c09hagz-01\"..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:08 CEST" level=info msg="Building Kubernetes clientset..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:08 CEST" level=info msg="Running cluster probes..."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:09 CEST" level=error msg="Host \"oci-kube-nodes-s6toh60-01\" is broken and needs to be manually removed\n"
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:09 CEST" level=warning msg="Hosts must be removed in a correct order to preserve the Etcd quorum."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:09 CEST" level=warning msg="Loss of the Etcd quorum can cause loss of all data!!!"
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:09 CEST" level=warning msg="After removing the recommended hosts, run 'kubeone apply' before removing any other host."
ts4-c-1-cluster-test-set-no4-cejmofr     time="09:01:09 CEST" level=warning msg="No other broken node can be removed without losing quorum."
ts4-c-1-cluster-test-set-no4-cejmofr     Error: configuration broken hosts check
ts4-c-1-cluster-test-set-no4-cejmofr     broken host(s) found, remove it manually

In some cases, it eventually succeeds in removing the control plane node after a couple of retries, however, in many it runs out of retries and fails.

Expected Behaviour

kube-eleven shouldn't fail in electing a leader or removing a control plane node when using a proxy.

Steps To Reproduce

  1. Create a cluster using a proxy
  2. Replace or remove a control-plane node pool
  3. Check logs of the kube-eleven