Currently update-operator reboots nodes as soon as updates are available. https://github.com/coreos/container-linux-update-operator/issues/82 tracks adding support for a user-configured maintenance window. On top of that, even inside a maintenance window there could be situations where reboots should be temporarily paused (e.g. when some critical/unplanned outage is happening).
This can be currently done by setting a reboot-paused annotation on specific nodes, however this is a manual operation and doesn't scale well cluster-wide.
It would be nice to let CLUO know about any existing AlertManager in the cluster and check for specific active alerts before proceeding. @brancz suggested that we could:
take a ConfigMap with critical alerts that should cluster-wide pause reboots (and inotify-watch to hot-reload it)
reach the AM on its in-cluster public read-only endpoint and check for non-silenced critical alerts before setting reboot-ok
For clarity, this should be completely orthogonal to maintenance window configuration.
Currently
update-operator
reboots nodes as soon as updates are available. https://github.com/coreos/container-linux-update-operator/issues/82 tracks adding support for a user-configured maintenance window. On top of that, even inside a maintenance window there could be situations where reboots should be temporarily paused (e.g. when some critical/unplanned outage is happening).This can be currently done by setting a
reboot-paused
annotation on specific nodes, however this is a manual operation and doesn't scale well cluster-wide.It would be nice to let CLUO know about any existing AlertManager in the cluster and check for specific active alerts before proceeding. @brancz suggested that we could:
reboot-ok
For clarity, this should be completely orthogonal to maintenance window configuration.