After a short etcd blip, fleet has issues on its agent and engine, but the process remains up. This is affecting v0.13.0
The symptoms are as follows: the Monitor detects the server failed heartbeat, asks all components to shut down, but the shutdown of all components never completes. This means that most `components are dead, the server process is still up, but serves:
{"error":{"code":503,"message":"fleet server unable to communicate with etcd"}}
The full error log is here:
Dec 07 19:28:55 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:236: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR reconciler.go:59: Failed getting current cluster state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: WARN engine.go:117: Engine completed reconciliation in 4.004575849s
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:237: Server monitor triggered: Monitor timed out before successful heartbeat
Dec 07 19:29:03 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:30:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:248: Timed out waiting for server to shut down
After a short
etcd
blip,fleet
has issues on its agent and engine, but the process remains up. This is affectingv0.13.0
The symptoms are as follows: the
Monitor
detects the server failed heartbeat, asks allcomponents
to shut down, but the shutdown of all components never completes. This means that most `components are dead, the server process is still up, but serves:{"error":{"code":503,"message":"fleet server unable to communicate with etcd"}}
The full error log is here:
The curious bit is this code: https://github.com/coreos/fleet/blob/v0.13.0/server/server.go#L248
I think that after
Timed out waiting for server to shut down
the server should just crash immediately.