Nodes are being marked as unhealthy

BuildOnViction / infrastructure

🏗 TomoChain internal infrastructure

17 stars 11 forks source link

Nodes are being marked as unhealthy #126

Closed etienne-napoleone closed 5 years ago

etienne-napoleone commented 5 years ago

As seen in the docker daemon logs, some healthy node sometimes get tagged as unhealthy if there is some network connectivity issues for some sec. See https://github.com/moby/moby/issues/36311 Possible solutions:

[x] increase node hearthbeat timeout (docker swarm update --dispatcher-heartbeat 15s)
[x] if still present, check with DO about local network connectivity
[ ] if still present, switch to external network swarm

etienne-napoleone commented 5 years ago

Still happen, Seems like the node is receiving the "suspect" probe, but cannot connect back when wanting to invalidate it.

time="2019-01-14T04:04:49.676462366Z" level=warning msg="memberlist: Refuting a suspect message (from: $MANAGER_NODE_ID)"
time="2019-01-14T04:04:49.930428901Z" level=warning msg="memberlist: Failed fallback ping: read tcp $NODE_IP:49290->$MANAGER_IP:7946: i/o timeout"

As the other machines have no problems, I'll try to simply reboot the machine. In the meanwhile, checking with DO

etienne-napoleone commented 5 years ago

Currently checking with DO, seems like packet loss in local can go up to 10-15%

etienne-napoleone commented 5 years ago

They did a networking maintenance some nights ago. The node has not been kicked out of the swarm since then. Seems like it's all right now

etienne-napoleone commented 5 years ago

Same issue is still happening. Not sure if it's a DO or Swarm problem.

Found this: https://github.com/moby/moby/issues/32195#issuecomment-305457491 But it's from 1y ago!

etienne-napoleone commented 5 years ago

I upgraded the managers to 2GB ram, in case it helps. There's been several RAM usage alarms on manager01, but it seems to after the node is being marked as unhealthy as a consequence of the work needed to reschedule the containers.

etienne-napoleone commented 5 years ago

Seems like it resolved the problem! :tada: