Closed cpswan closed 2 years ago
All nodes are now running:
sudo docker run -d --restart unless-stopped reg.swarm0001.atsign.zone/atsigncompany/at_swarm_load
(or from staging0001 registry where applicable)
rather than a docker service.
The at_swarm_load docker image has also been updated to use the latest python base, and automation has been added to track upstream dependencies and rebuild the image when they change.
Describe the bug
At present the GCP custom monitoring container runs as a global swarm service to ensure that there is one container on every node.
This works fine when everything is working properly, but means that the monitoring container isn't running when a node is drained.
We've also found that it can take substantial time for he monitoring container to (re)start when a node is brought back to active, particularly if the manager is busy (as can happen when rebalancing a cluster).
Expected behavior
We should have monitoring in place at all times.