DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Resolve k8s dev cluster errors (control plane pod restarts) #7

Open gothub opened 3 years ago

gothub commented 3 years ago

The dev k8s cluster is experiencing k8s system errors, with the kube-controller, kube-scheduler and etcd server experiencing many errors and restarts. Here is the pod, service info:

% kubectl get pods,services --all-namespaces -o wide
NAMESPACE       NAME                                            READY   STATUS    RESTARTS   AGE     IP               NODE                NOMINATED NODE   READINESS GATES
kube-system     pod/calico-kube-controllers-78d6f96c7b-t85g9    1/1     Running   113        28d     192.168.50.147   docker-dev-ucsb-2   <none>           <none>
kube-system     pod/calico-node-9jmnl                           1/1     Running   32         23d     128.111.85.191   docker-dev-ucsb-2   <none>           <none>
kube-system     pod/calico-node-fkvm8                           1/1     Running   0          23d     128.111.85.190   docker-dev-ucsb-1   <none>           <none>
kube-system     pod/coredns-558bd4d5db-7gk2f                    1/1     Running   3          34d     192.168.3.154    docker-dev-ucsb-1   <none>           <none>
kube-system     pod/coredns-558bd4d5db-8cfgx                    1/1     Running   3          34d     192.168.3.146    docker-dev-ucsb-1   <none>           <none>
kube-system     pod/etcd-docker-dev-ucsb-1                      1/1     Running   28         34d     128.111.85.190   docker-dev-ucsb-1   <none>           <none>
kube-system     pod/kube-apiserver-docker-dev-ucsb-1            1/1     Running   27         34d     128.111.85.190   docker-dev-ucsb-1   <none>           <none>
kube-system     pod/kube-controller-manager-docker-dev-ucsb-1   1/1     Running   745        34d     128.111.85.190   docker-dev-ucsb-1   <none>           <none>
kube-system     pod/kube-proxy-pmh8p                            1/1     Running   4          34d     128.111.85.191   docker-dev-ucsb-2   <none>           <none>
kube-system     pod/kube-proxy-qhtq9                            1/1     Running   3          34d     128.111.85.190   docker-dev-ucsb-1   <none>           <none>
kube-system     pod/kube-scheduler-docker-dev-ucsb-1            1/1     Running   745        34d     128.111.85.190   docker-dev-ucsb-1   <none>           <none>

Here is a sample from the etcd log:

2021-06-25 20:24:29.846546 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-06-25 20:24:32.488122 W | etcdserver: read-only range request "key:\"/registry/crd.projectcalico.org/networkpolicies/\" range_end:\"/registry/crd.projectcalico.org/networkpolicies0\" count_only:true " with result "range_response_count:0 size:7" took too long (155.539438ms) to execute
2021-06-25 20:24:32.488293 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:7" took too long (363.521981ms) to execute
2021-06-25 20:24:34.085927 W | etcdserver: read-only range request "key:\"/registry/volumeattachments/\" range_end:\"/registry/volumeattachments0\" count_only:true " with result "range_response_count:0 size:7" took too long (117.505394ms) to execute

... and from kube-scheduler:

E0625 19:52:08.671388       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://128.111.85.190:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0625 20:00:47.925831       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://128.111.85.190:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I'm not seeing error messages from kube-controller.

docker-dev-ucsb-1 has adequate disk space and the system is running at a load average of "0.82 0.92 1.05"

Note that the production cluster is not experiencing these errors or restarts.

gothub commented 3 years ago

This issue is also mentioned in an NCEAS computing repo issue

gothub commented 2 years ago

Update - the k8s pods kube-scheduler and kube-controller-manager have crashed and been restarted 1300+ times in the last 60 days. Here is a sample from the kube-controller-manager log:

...
goroutine 150 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0009fcf50, 0xc000000001)
        /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc0009fcf40)
        /usr/local/go/src/sync/cond.go:56 +0x99
k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue.(*Type).Get(0xc000180b40, 0x0, 0x0, 0x3d96e00)
...

The frequency of the warning messages increases to about 10 every few seconds when even a light processing load (i.e. assessment report jobs, 10 workers) is running.

It may be that etcd slowness may be causing the kube containers to crash.

gothub commented 2 years ago

Typical kube-controller-manager log that occurs before restarting:

goroutine 956 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc00133fad0, 0xc000000000)
    /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc00133fac0)
    /usr/local/go/src/sync/cond.go:56 +0x99
k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue.(*Type).Get(0xc00091f2c0, 0x0, 0x0, 0x3d96e00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/queue.go:151 +0x89
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicCertKeyPairContent).processNextWorkItem(0xc000171260, 0x203000)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_serving_content.go:153 +0x66
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicCertKeyPairContent).runWorker(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_serving_content.go:148
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00088b390)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00088b390, 0x500ac60, 0xc001752a20, 0x4b24b01, 0xc0004cb080)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00088b390, 0x3b9aca00, 0x0, 0x1, 0xc0004cb080)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00088b390, 0x3b9aca00, 0xc0004cb080)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicCertKeyPairContent).Run
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_serving_content.go:134 +0x28b
gothub commented 2 years ago

The kube-scheduler and kube-controller-manager restarts appear to be due to request timeouts. These services restart when a request they make times out, so that another replica can become the new leader. Even though we only have a one node control plane, these services will restart instead of attempting retry there request, as described in this issue: Kube-scheduler dies when connection to API server is lost

So this does not explain why the k8s api-server is not responding in time, but it does explain the restarts. It's possible that having a mult-node system could compensate for the underlying problem. The underlying problem has yet to be resolved.

Here are some relevant log entries:

gothub commented 2 years ago

A test on k8s dev was made to attempt to resolve the restart issue, by increasing the etcd config parameters located in /etc/kubernetes/manifest/etcd.yaml. Here is a snippet from that file, where the last two lines here were added, which have the effect of increasing the heartbeat interval and leader election timeout:

spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://128.111.85.190:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://128.111.85.190:2380
    - --initial-cluster=docker-dev-ucsb-1=https://128.111.85.190:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://128.111.85.190:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://128.111.85.190:2380
    - --name=docker-dev-ucsb-1
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --heartbeat-interval=1000
    - --election-timeout=10000

When this file is updated (i.e. via vi editor), etcd will automatically be restarted. This change was made yesterday afternoon, and doesn't appear to have prevented kube service restarts, as shown by a recent k8s status, as seen by the value in parenthesis (last restarted) under the RESTARTS column:

kubectl get pods,services -n kube-system -o wide
NAME                                            READY   STATUS    RESTARTS         AGE    IP               NODE                NOMINATED NODE   READINESS GATES
pod/calico-kube-controllers-78d6f96c7b-sld68    1/1     Running   646              120d   192.168.3.135    docker-dev-ucsb-1   <none>           <none>
pod/calico-node-9jmnl                           1/1     Running   174 (26d ago)    196d   128.111.85.191   docker-dev-ucsb-2   <none>           <none>
pod/calico-node-fkvm8                           1/1     Running   89 (47h ago)     196d   128.111.85.190   docker-dev-ucsb-1   <none>           <none>
pod/coredns-78fcd69978-qktrm                    1/1     Running   2 (25d ago)      120d   192.168.3.166    docker-dev-ucsb-1   <none>           <none>
pod/coredns-78fcd69978-xjqjz                    1/1     Running   2 (25d ago)      120d   192.168.3.144    docker-dev-ucsb-1   <none>           <none>
pod/etcd-docker-dev-ucsb-1                      1/1     Running   0                24h    128.111.85.190   docker-dev-ucsb-1   <none>           <none>
pod/kube-apiserver-docker-dev-ucsb-1            1/1     Running   2 (4h21m ago)    47h    128.111.85.190   docker-dev-ucsb-1   <none>           <none>
pod/kube-controller-manager-docker-dev-ucsb-1   1/1     Running   40 (3h46m ago)   47h    128.111.85.190   docker-dev-ucsb-1   <none>           <none>
pod/kube-proxy-7hmdm                            1/1     Running   1 (78d ago)      120d   128.111.85.191   docker-dev-ucsb-2   <none>           <none>
pod/kube-proxy-x54mw                            1/1     Running   2 (25d ago)      120d   128.111.85.190   docker-dev-ucsb-1   <none>           <none>
pod/kube-scheduler-docker-dev-ucsb-1            1/1     Running   38 (3h46m ago)   47h    128.111.85.190   docker-dev-ucsb-1   <none>           <none>
amoeba commented 2 years ago

@gothub mentioned this on out weekly dev call today and I thought I'd drop a data point in from my single-node cluster running on another NCEAS VM:

mecum@neutral-cat:~$ k get -n kube-system  pods
NAME                                  READY   STATUS    RESTARTS          AGE
coredns-78fcd69978-ml7sl              1/1     Running   1 (3d3h ago)      38d
coredns-78fcd69978-xq7qt              1/1     Running   2                 38d
etcd-neutral-cat                      1/1     Running   6885 (2d7h ago)   38d
kube-apiserver-neutral-cat            1/1     Running   229               38d
kube-controller-manager-neutral-cat   1/1     Running   629 (4h4m ago)    38d
kube-proxy-dpvtb                      1/1     Running   1 (3d3h ago)      38d
kube-scheduler-neutral-cat            1/1     Running   631 (4h4m ago)    38d
weave-net-t6vtk                       2/2     Running   3 (3d2h ago)      38d

More restarts than seems remotely reasonable. I see a lot of warnings from etcd about deadlocks and timeouts but none clearly indicate a restart. Haven't made a good effort at figuring it out though.

mbjones commented 2 years ago

Similar problem from the interwebs: https://platform9.com/kb/kubernetes/excessive-kubernetes-master-pod-restarts They say:

Cause

  • Etcd heartbeats are timing out, resulting in frequent leader elections.
  • The kube-controller-manager and kube-scheduler container logs show etcd read timeouts due to the leader elections, resulting in the restart of these containers.

Resolution

  • The default values of heartbeat-interval and election-timeout are 100ms and 1000ms, respectively. For Azure, we've had to increase these values to 1000ms and 10000ms and these defaults are included in Platform9 Managed Kubernetes v4.1+.
amoeba commented 2 years ago

Good find. If I grep my etcd logs for "error", I get:

WARNING: 2021/12/16 22:45:27 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" {"level":"warn","ts":"2021-12-16T22:45:50.129Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"2.000535152s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}