etcd member cannot start up because of liveness check failures

armstrongli commented 5 years ago

We encountered etcd member start failures recently in our kubernetes cluster. the phenomena is that etcd members keep crashing loop off.

the error is raised by no space left on the disk. the logs of etcd server container is no space left on the disk. After checking the disk file status, it is true that all the disk space is used by by etcd snap dir.

-bash-4.2# ls -lht
total 983G
-rw------- 1 root root 145M Aug  4 22:11 tmp158215125
-rw------- 1 root root 799M Aug  4 22:10 db
-rw------- 1 root root 256M Aug  4 22:04 tmp300815437
-rw------- 1 root root 405M Aug  4 22:01 tmp320617345
-rw------- 1 root root 255M Aug  4 21:53 tmp153885235
-rw------- 1 root root 288M Aug  4 21:49 tmp208855777
-rw------- 1 root root 268M Aug  4 21:41 tmp525631539
......

After checking the logs of etcd server, it shows that the server has start up logs, it shows that the server start up normally and exit without any specific errors. the error code of container is 137. It means that the container exits because of kill command from container runtime. And the error is not OOM. It means that it is killed on purpose.

After checking the logs of kubelet, it shows that it is killed because of liveness check failures:

Aug 04 22:50:44 foo.bar.com kubelet[24271]: I0804 22:50:44.137115   24271 prober.go:111] Liveness probe for "etcd-0010_foooo(f830b222-90d5-11e9-8966-74dbd1802f80):etcd" failed (failure): Error:  context deadline exceeded

after checking the logs of etcd server, it shows that there are logs to get snapshot data from leader, but there are no logs about successfully loaded. It means that etcd container is killed before it is able to start up.

...
2019-08-05 05:12:50.179645 I | rafthttp: receiving database snapshot [index:200843417, from 9df0abd4b7831a12] ...
...

The liveness check of ETCD Pod is

    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://localhost:2379 --cert=/etc/etcdtls/operator/etcd-tls/etcd-client.crt
          --key=/etc/etcdtls/operator/etcd-tls/etcd-client.key --cacert=/etc/etcdtls/operator/etcd-tls/etcd-client-ca.crt
          get foo
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 60
      successThreshold: 1
      timeoutSeconds: 10

kubelet starts doing liveness check after 10s and there are 3 failures allowed before the container is killed. it means: 10 + 3 * 10 = 40s.

The time is too short to transmit a snapshot from leader to member.

armstrongli commented 5 years ago

Here is the PR to fix this issue: https://github.com/coreos/etcd-operator/pull/2108

eliaoggian commented 4 years ago

My problem was related to never compacting and defragging the DB, this caused the DB to become huge (1.5GB) and therefore the new pod failed to start in time as the readiness probe failed.

After compacting and defragging, the DB is 700kB and the pods start with no issues.

Maybe this helps.

coreos / etcd-operator

etcd member cannot start up because of liveness check failures #2109