Open armstrongli opened 5 years ago
Here is the PR to fix this issue: https://github.com/coreos/etcd-operator/pull/2108
My problem was related to never compacting and defragging the DB, this caused the DB to become huge (1.5GB) and therefore the new pod failed to start in time as the readiness probe failed.
After compacting and defragging, the DB is 700kB and the pods start with no issues.
Maybe this helps.
We encountered etcd member start failures recently in our kubernetes cluster. the phenomena is that etcd members keep crashing loop off.
the error is raised by no space left on the disk. the logs of etcd server container is
no space left on the disk
. After checking the disk file status, it is true that all the disk space is used by by etcd snap dir.After checking the logs of etcd server, it shows that the server has start up logs, it shows that the server start up normally and exit without any specific errors. the error code of container is
137
. It means that the container exits because ofkill
command from container runtime. And the error is not OOM. It means that it is killed on purpose.After checking the logs of kubelet, it shows that it is killed because of liveness check failures:
after checking the logs of etcd server, it shows that there are logs to get snapshot data from leader, but there are no logs about successfully loaded. It means that etcd container is killed before it is able to start up.
The liveness check of ETCD Pod is
kubelet starts doing liveness check after 10s and there are 3 failures allowed before the container is killed. it means: 10 + 3 * 10 = 40s.
The time is too short to transmit a snapshot from leader to member.