Closed etienne-napoleone closed 5 years ago
Still happen, Seems like the node is receiving the "suspect" probe, but cannot connect back when wanting to invalidate it.
time="2019-01-14T04:04:49.676462366Z" level=warning msg="memberlist: Refuting a suspect message (from: $MANAGER_NODE_ID)"
time="2019-01-14T04:04:49.930428901Z" level=warning msg="memberlist: Failed fallback ping: read tcp $NODE_IP:49290->$MANAGER_IP:7946: i/o timeout"
As the other machines have no problems, I'll try to simply reboot the machine. In the meanwhile, checking with DO
Currently checking with DO, seems like packet loss in local can go up to 10-15%
They did a networking maintenance some nights ago. The node has not been kicked out of the swarm since then. Seems like it's all right now
Same issue is still happening. Not sure if it's a DO or Swarm problem.
Found this: https://github.com/moby/moby/issues/32195#issuecomment-305457491 But it's from 1y ago!
I upgraded the managers to 2GB ram, in case it helps. There's been several RAM usage alarms on manager01, but it seems to after the node is being marked as unhealthy as a consequence of the work needed to reschedule the containers.
Seems like it resolved the problem! :tada:
As seen in the docker daemon logs, some healthy node sometimes get tagged as unhealthy if there is some network connectivity issues for some sec. See https://github.com/moby/moby/issues/36311 Possible solutions:
docker swarm update --dispatcher-heartbeat 15s
)