flatcar / flatcar-linux-update-operator

A Kubernetes operator to manage updates of Flatcar Container Linux
Apache License 2.0
103 stars 19 forks source link

[RFE] Add delay between reboots #167

Open mikekuzak opened 2 years ago

mikekuzak commented 2 years ago

Current situation

Hi,

We have a K8ssandra cluster running on our K8s cluster. Flatcar reboots quite quickly but some applications might take longer to startup and initialize. The Flatcar operator doesn't know anything about running apps as it's not designed to do this.

Impact

Applications might lose quorum when a K8s Cluster running on flatcar bounces the nodes to fast.

Ideal future situation

It would be good to have some sort of mechanize which would prevent a reboot too fast, even if flatcar is already up.

Implementation options

Maybe there a way to add a simple time based solution. Add a delay of 10min before the next eligible node reboots.

Thanks

invidian commented 2 years ago

I think a good solution would be to use before/after reboot annotations to simply wait some time before proceeding with node reboot. Perhaps looking at #37 will give you an idea on which stage you want to produce the annotations.

I've also created #168 to make it more obvious how to implement some custom rebooting logic, as I don't think existing examples are good enough.

Let me know if you're able to implement it yourself. If not, I'll help you out.

#

Alternatively we could expose ReconciliationPeriod parameter in operator, which could be increased from default 30 seconds to let's say 10 minutes, so nodes reboot roughly every 20 minutes then (See #75) https://github.com/flatcar-linux/flatcar-linux-update-operator/blob/53f08043e320c853940ed7b4c126c7b72af1af00/pkg/operator/operator.go#L98

However for this, right now operator CLI has no tests so those should be added first and also, ideally operator will change it's operating model to be event-based (#143 ), so such delay won't be easy to implement anymore.

simonello commented 2 years ago

Alternatively we could expose ReconciliationPeriod parameter in operator, which could be increased from default 30 seconds to let's say 10 minutes, so nodes reboot roughly every 20 minutes then (See #75)

This would be a great feature to expose. How hard would this be to implement?

invidian commented 2 years ago

How hard would this be to implement?

Dead simple to implement right now, but hard to maintain, this is why I suggested using hooks instead.

dgsardina commented 4 months ago

Anyone found a way to set some kind of simple delay between reboots? We are hitting this problem on every upgrade, and an easy workaround would be to add 10m extra delay.