Add delay and smarter verification between node restarts

jahkeup commented 4 years ago

What I'd like:

Dogswatch should add some delay between the restart of Nodes in a cluster. During this time, the Controller should check in with the Node that has been updated to confirm that it has come up healthy and that Workloads have returned to it. After this, the Controller should have a configurable duration used to delay between each Node restart.

samuelkarp commented 4 years ago

This seems potentially related to before and after reboot checks.

anguslees commented 4 years ago

Indeed. In case you want concrete use cases for before/after reboot checks, I use them (with coreos/flatcar update operator) to delay until the rook/ceph cluster is healthy[1], and to signal to rook/ceph that the rook storage cluster should set the "noout" flag[2]. After reboot clears the noout flag and again blocks until cluster is healthy again.

[1] eg: Data is replicated sufficiently. This signal is "global" and much more complex than what a single pod readinessProbe can represent, which is why it can't be just a PodDisruptionBudget. A better implementation might only consider the redundancy of the data on "this" node. In particular, a naive time "delay", or generic check that pods were running again (as suggested in the issue description) would not be sufficient here.

[2] noout means the node outage that is about to happen is expected to be brief, and rook should not start frantically re-replicating "lost" data onto new nodes.

This wasn't my idea at all, the standard rook docs for this are: https://rook.io/docs/rook/v1.4/container-linux.html

Having used this for a long time now, it works great. What might not be obvious at first is that the reboot script itself is deployed as a daemonset limited to nodes with the "before-reboot" label. That means it automatically "finds" and installs itself only on the relevant nodes, and only for the relevant time, which is pretty neat. Debugging the system when updates are not proceeding does require an understanding of the various state machine interactions though, of course.

I would expect very similar challenges exist for something like an elasticsearch cluster, where data replication is important and also not represented in the "health" of any specific container. I agree this probably points to a missing feature in PodDisruptionBudget, since it is still fundamentally a question of "is it ok to make $pod unavailable now".

chancez commented 4 years ago

I'm not sure about the best approach, but one of my use-cases is jupyterhub notebook pods. These pods can't be interrupted, but we regularly cull inactive/idle ones. I'd like to be able to cordon the node that needs updating, and wait for the notebook pods to be stopped (which could be a while) before allowing with the node to be rebooted. This might be done using a tool like https://github.com/planetlabs/draino, but the update-operator would need coordination.

jahkeup commented 4 years ago

Thanks for sharing your use case and laying out what your ideal operation would look like.

This might be done using a tool like planetlabs/draino, but the update-operator would need coordination.

Draino looks very closely related to this problem space. The project appears to build on the Kubernetes autoscaler in order to accomplish its task. I'm curious what other projects are integrating with the autoscaler and what they use to enhance the features provided.

We'll likely check out both of these projects as the design is sketched out.

bottlerocket-os / bottlerocket-update-operator

Add delay and smarter verification between node restarts #12