kube-apiserver bootstrap is unstable

schu commented 6 years ago

landscape-setup-template users frequently hit an error during cluster setup or end up with an unhealthy cluster where only 2 out of 3 kube-apiserver pods are running. Currently, we know of the following symptoms:

Cluster setup fails early due to etcd operator errors (https://github.com/gardener/kubify/issues/48):

E0612 12:16:54.377420       1 leaderelection.go:224] error retrieving resource lock kube-system/etcd-operator: Get https://10.241.0.1:443/api/v1/namespaces/kube-system/endpoints/etcd-operator: dial tcp 10.241.0.1:443: getsockopt: connection refused

Cluster is unhealty due to kube-controller-manager continuously throwing errors (pod stays running though):

E0608 08:53:38.676846       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.241.0.1:443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: dial tcp 10.241.0.1:443: getsockopt: connection refused

The fact that the error (dial tcp 10.241.0.1:443: getsockopt: connection refused) is encountered for all requests looks like a routing error at first: 2 out of 3 apiserver instances are running and reachable after all and we expect the requests to the service IP to be distributed among the set of available pods (i.e. shouldn't 2 out of 3 requests succeed?).

This is most likely due to the (default) sessionAffinity setting for the default/kubernetes service:

  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

When a request from a source IP was routed to a KUBE-SEP once, it will be routed there for the next 3 hours (10800 seconds). E.g. if the leading controller-manager pod happens to be routed to the faulty node (w/o kube-apiserver running), all requests will end up there until the timeout is reached. The iptables rules for that look like:

-A KUBE-SEP-2KS7SNAGO5W6YUEJ -s 10.251.142.41/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-2KS7SNAGO5W6YUEJ -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-2KS7SNAGO5W6YUEJ --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.251.142.41:443
-A KUBE-SEP-Y44MF4JTQLZ5QDG7 -s 10.251.185.148/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-Y44MF4JTQLZ5QDG7 -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-Y44MF4JTQLZ5QDG7 --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.251.185.148:443
-A KUBE-SEP-YPQKGMLT4E2MSUWB -s 10.251.188.234/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-YPQKGMLT4E2MSUWB -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-YPQKGMLT4E2MSUWB --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.251.188.234:443
-A KUBE-SERVICES ! -s 10.241.0.0/17 -d 10.241.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.241.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-2KS7SNAGO5W6YUEJ --mask 255.255.255.255 --rsource -j KUBE-SEP-2KS7SNAGO5W6YUEJ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-Y44MF4JTQLZ5QDG7 --mask 255.255.255.255 --rsource -j KUBE-SEP-Y44MF4JTQLZ5QDG7
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-YPQKGMLT4E2MSUWB --mask 255.255.255.255 --rsource -j KUBE-SEP-YPQKGMLT4E2MSUWB
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-2KS7SNAGO5W6YUEJ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-Y44MF4JTQLZ5QDG7
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-YPQKGMLT4E2MSUWB

By removing the sessionAffinity setting from the default/kubernetes service (e.g. with kubectl edit svc kubernetes), the problem can be fixed for symptom 2 (as described above): controller-manager will eventually hit a healthy apiserver instance and be able go on with its tasks. The missing kube-apiserver pod will be rescheduled shortly after.

Noteworthy is that on the faulty master node where kube-apiserver is not running, the checkpoint is also missing (otherwise the pod should be running again shortly after it stopped), find /etc/kubernetes/ -iname '*api*' is empty. The checkpointer logs shows the following:

[...]
I0614 07:59:20.397358       1 main.go:225] API GC: skipping inactive checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:23.410633       1 main.go:394] Checkpoint manifest for "kube-system/kube-apiserver-tdmfk" already exists. Skipping                                                                                                         
I0614 07:59:23.421755       1 main.go:225] API GC: skipping inactive checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:26.433228       1 main.go:394] Checkpoint manifest for "kube-system/kube-apiserver-tdmfk" already exists. Skipping                                                                                                         
I0614 07:59:31.270644       1 main.go:225] API GC: skipping inactive checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:31.270759       1 main.go:283] Should start checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:31.270949       1 main.go:397] Writing manifest for "kube-system/kube-apiserver-tdmfk" to "/etc/kubernetes/manifests/kube-system-kube-apiserver-tdmfk.json"                                                                
I0614 07:59:34.364611       1 main.go:234] API GC: should remove inactive checkpoint kube-system/kube-apiserver-tdmfk                                                                                                                  
I0614 07:59:34.364673       1 main.go:250] API GC: should remove active checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:34.364778       1 main.go:735] Removing checkpoint of: kube-system/kube-apiserver-tdmfk

Current status:

I don't know yet why this happens, but the root cause seems to be a problem during kube-apiserver bootstrapping. I'll add more info as I find it.

Any ideas? :)

afritzler commented 6 years ago

@schu on which infrastructure are you setting up your cluster? I haven't seen this behaviour on OpenStack.

schu commented 6 years ago

@afritzler on AWS.

schu commented 6 years ago

I have also applied this patch to kubify to get more logs and keep finished containers for debugging:

diff --git a/modules/nodes/templates/cloud-init b/modules/nodes/templates/cloud-init
index 90f4414..d482e8d 100644
--- a/modules/nodes/templates/cloud-init
+++ b/modules/nodes/templates/cloud-init
@@ -57,6 +57,8 @@ ${units}  - name: bootstrap.service
         --mount=volume=etc-resolv-conf,target=/etc/resolv.conf \
         --insecure-options=image"
       ExecStart=/usr/lib/coreos/kubelet-wrapper \
+        --minimum-container-ttl-duration=60m \
+        -v 3 \
         --kubeconfig=/etc/kubernetes/kubelet.conf \
         --require-kubeconfig=true \
         --pod-manifest-path=/etc/kubernetes/manifests \

afritzler commented 6 years ago

Thanks @schu

So at the end the problem is that the kubernetes service in the default namespace has the following session affinity timeout

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: 2018-02-26T15:35:29Z
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
  resourceVersion: "62"
  selfLink: /api/v1/namespaces/default/services/kubernetes
  uid: acb349eb-1b0a-11e8-8d96-fa163e521bd0
spec:
  clusterIP: 10.241.0.1
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: ClusterIP
status:
  loadBalancer: {}

afritzler commented 6 years ago

Here is the corresponding default in the code: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L2970

afritzler commented 6 years ago

This issue is a known one and has been fixed with https://github.com/kubernetes/kubernetes/pull/56690

I created 2 backport PRs for the 1.9 and 1.10 release branch in kubernetes. https://github.com/kubernetes/kubernetes/pull/65178 https://github.com/kubernetes/kubernetes/pull/65177

schu commented 6 years ago

@afritzler thanks :+1:

afritzler commented 6 years ago

Well, thank you guys for the findings!

afritzler commented 6 years ago

Ok, the cherry pick for 1.10 has been merged now. For 1.9 and 1.8 are coming up as well. I will close this issue.

gardener-attic / kubify

kube-apiserver bootstrap is unstable #49

Current status: