Handling node failures - Githubissues

Currently if a node crashed or shutdown, node status will be unready (failed status). Right after node failure, etcd pod wouldn't be deleted, its phase would be still Running, and status would become unready or unknown.

After some timeout (default 5m), etcd pods would get evicted. Check:

Taint Based Evictions: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ and https://github.com/coreos/etcd-operator/issues/1117
--pod-eviction-timeout: https://kubernetes.io/docs/reference/generated/kube-controller-manager/

There are two problems here: a node could have restarted before eviction, and node unready status is not a good indication of etcd pod health.

Node restart

After node restart, etcd pod would just die because it uses empty-dir to store data and empty-dir is not persistent in this case. See issue https://github.com/coreos/etcd-operator/issues/1839.

We could solve this store data on more persistent volume. Check:

persistent etcd cluster issue: https://github.com/coreos/etcd-operator/issues/1323
PV etcd data proposal: https://github.com/coreos/etcd-operator/blob/master/doc/design/persistent_volumes_etcd_data.md

Fine-grained health checking and timeout

Even though node unreadiness might indicate etcd pods on it being unhealthy, but this is not fine-granular enough: kubelet could have bug but etcd process is fine, node is fine but etcd process could crash, etc.

For above reasons, we need application level health checking to detect unhealthy etcd processes and add custom toleration policies for unhealthy members.

A simple readiness probe has been added in https://github.com/coreos/etcd-operator/issues/1320 which could be used as health check.

We will need to add custom toleration policies like how long can I tolerate an etcd member being unhealthy and what cases could be tolerated (e.g. data corruption could not). After toleration period, etcd-operator would replace unhealthy members onto other nodes.

coreos / etcd-operator

Handling node failures #1856

Node restart

Fine-grained health checking and timeout