atsign-foundation / at_server

The software implementation of Atsign's core technology
https://docs.atsign.com
BSD 3-Clause "New" or "Revised" License
40 stars 12 forks source link

GCP custom monitoring container shouldn't be run as a Swarm service #651

Closed cpswan closed 2 years ago

cpswan commented 2 years ago

Describe the bug

At present the GCP custom monitoring container runs as a global swarm service to ensure that there is one container on every node.

This works fine when everything is working properly, but means that the monitoring container isn't running when a node is drained.

We've also found that it can take substantial time for he monitoring container to (re)start when a node is brought back to active, particularly if the manager is busy (as can happen when rebalancing a cluster).

Expected behavior

We should have monitoring in place at all times.

cpswan commented 2 years ago

All nodes are now running:

sudo docker run -d --restart unless-stopped reg.swarm0001.atsign.zone/atsigncompany/at_swarm_load

(or from staging0001 registry where applicable)

rather than a docker service.

The at_swarm_load docker image has also been updated to use the latest python base, and automation has been added to track upstream dependencies and rebuild the image when they change.