Closed schu closed 6 years ago
@schu on which infrastructure are you setting up your cluster? I haven't seen this behaviour on OpenStack.
@afritzler on AWS.
I have also applied this patch to kubify to get more logs and keep finished containers for debugging:
diff --git a/modules/nodes/templates/cloud-init b/modules/nodes/templates/cloud-init
index 90f4414..d482e8d 100644
--- a/modules/nodes/templates/cloud-init
+++ b/modules/nodes/templates/cloud-init
@@ -57,6 +57,8 @@ ${units} - name: bootstrap.service
--mount=volume=etc-resolv-conf,target=/etc/resolv.conf \
--insecure-options=image"
ExecStart=/usr/lib/coreos/kubelet-wrapper \
+ --minimum-container-ttl-duration=60m \
+ -v 3 \
--kubeconfig=/etc/kubernetes/kubelet.conf \
--require-kubeconfig=true \
--pod-manifest-path=/etc/kubernetes/manifests \
Thanks @schu
So at the end the problem is that the kubernetes service in the default namespace has the following session affinity timeout
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2018-02-26T15:35:29Z
labels:
component: apiserver
provider: kubernetes
name: kubernetes
namespace: default
resourceVersion: "62"
selfLink: /api/v1/namespaces/default/services/kubernetes
uid: acb349eb-1b0a-11e8-8d96-fa163e521bd0
spec:
clusterIP: 10.241.0.1
ports:
- name: https
port: 443
protocol: TCP
targetPort: 443
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
type: ClusterIP
status:
loadBalancer: {}
Here is the corresponding default in the code: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L2970
This issue is a known one and has been fixed with https://github.com/kubernetes/kubernetes/pull/56690
I created 2 backport PRs for the 1.9 and 1.10 release branch in kubernetes. https://github.com/kubernetes/kubernetes/pull/65178 https://github.com/kubernetes/kubernetes/pull/65177
@afritzler thanks :+1:
Well, thank you guys for the findings!
Ok, the cherry pick for 1.10 has been merged now. For 1.9 and 1.8 are coming up as well. I will close this issue.
landscape-setup-template users frequently hit an error during cluster setup or end up with an unhealthy cluster where only 2 out of 3 kube-apiserver pods are running. Currently, we know of the following symptoms:
Cluster setup fails early due to etcd operator errors (https://github.com/gardener/kubify/issues/48):
Cluster is unhealty due to kube-controller-manager continuously throwing errors (pod stays running though):
The fact that the error (
dial tcp 10.241.0.1:443: getsockopt: connection refused
) is encountered for all requests looks like a routing error at first: 2 out of 3 apiserver instances are running and reachable after all and we expect the requests to the service IP to be distributed among the set of available pods (i.e. shouldn't 2 out of 3 requests succeed?).This is most likely due to the (default)
sessionAffinity
setting for thedefault/kubernetes
service:When a request from a source IP was routed to a
KUBE-SEP
once, it will be routed there for the next 3 hours (10800 seconds). E.g. if the leading controller-manager pod happens to be routed to the faulty node (w/o kube-apiserver running), all requests will end up there until the timeout is reached. The iptables rules for that look like:By removing the
sessionAffinity
setting from thedefault/kubernetes
service (e.g. withkubectl edit svc kubernetes
), the problem can be fixed for symptom 2 (as described above): controller-manager will eventually hit a healthy apiserver instance and be able go on with its tasks. The missing kube-apiserver pod will be rescheduled shortly after.Noteworthy is that on the faulty master node where kube-apiserver is not running, the checkpoint is also missing (otherwise the pod should be running again shortly after it stopped),
find /etc/kubernetes/ -iname '*api*'
is empty. The checkpointer logs shows the following:Current status:
I don't know yet why this happens, but the root cause seems to be a problem during kube-apiserver bootstrapping. I'll add more info as I find it.
Any ideas? :)