docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.75k stars 1.08k forks source link

Re-check containers after receiving from resumeCh #2882

Closed eklitzke closed 6 years ago

eklitzke commented 6 years ago

I'm trying to fix an issue that I believe is related to PR #2860. In production I have a swarm of four docker nodes, each running a large number of containers. Restarting docker on one of the nodes can take a while (often more than a minute). Sometimes this causes swarm to stop sending new requests to that worker. I'm not sure if the swarm process would ever re-schedule work to that node, typically when this happens I've restarted the swarm process to force the swarm manager to get back into a good state. When this error happens, docker info on the swarm manager indicates that swarm thinks the host has containers scheduled to it, when in fact the host has no containers scheduled. When this happens I see a log entry with the text "Engine came back to life..." which was added in PR #2860. My guess is this happens 1 in 10 times that I restart the docker daemon on a node.

From my non-expert look at the code, I think the issue is that the return statement in the current code is causing e.RefreshContainers() not to be run a few lines below. I've tested this change in production and the problem I was experiencing doesn't seem to recur: but since this issue was only happening sporadically even before, this is inconclusive. Does this change look right?

eklitzke commented 6 years ago

I just realized this is a no-op due to the fact that Go doesn't have implicit fallthrough. We are still experiencing a problem related to this code path, however.