coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

fleet: server monitor fails to shutdown process #1716

Closed mwitkow closed 7 years ago

mwitkow commented 7 years ago

After a short etcd blip, fleet has issues on its agent and engine, but the process remains up. This is affecting v0.13.0

The symptoms are as follows: the Monitor detects the server failed heartbeat, asks all components to shut down, but the shutdown of all components never completes. This means that most `components are dead, the server process is still up, but serves: {"error":{"code":503,"message":"fleet server unable to communicate with etcd"}}

The full error log is here:

Dec 07 19:28:55 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:236: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR reconciler.go:59: Failed getting current cluster state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: WARN engine.go:117: Engine completed reconciliation in 4.004575849s
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:237: Server monitor triggered: Monitor timed out before successful heartbeat
Dec 07 19:29:03 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:30:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:248: Timed out waiting for server to shut down

The curious bit is this code: https://github.com/coreos/fleet/blob/v0.13.0/server/server.go#L248

func (s *Server) Supervise() {
    sd, err := s.mon.Monitor(s.hrt, s.killc)
    if sd {
        log.Infof("Server monitor triggered: told to shut down")
    } else {
        log.Errorf("Server monitor triggered: %v", err)
    }
    close(s.stopc)
    done := make(chan struct{})
    go func() {
        s.wg.Wait()
        close(done)
    }()
    select {
    case <-done:
    case <-time.After(shutdownTimeout):
        log.Errorf("Timed out waiting for server to shut down")
        sd = true
    }
    if !sd {
        log.Infof("Restarting server")
        s.SetRestartServer(true)
        s.Run()
        s.SetRestartServer(false)
    }
}

I think that after Timed out waiting for server to shut down the server should just crash immediately.

dongsupark commented 7 years ago

Closed via https://github.com/coreos/fleet/pull/1717