flatcar / flatcar-linux-update-operator

A Kubernetes operator to manage updates of Flatcar Container Linux
Apache License 2.0
103 stars 19 forks source link

Ensure update-agent waits for all volumes to be detached before rebooting #30

Open invidian opened 4 years ago

invidian commented 4 years ago

Original issue: https://github.com/coreos/container-linux-update-operator/issues/191

Perhaps waiting for kubectl get volumeattachments to get empty with the right selector would be sufficient?

Jasper-Ben commented 3 years ago

Scenario: Running a ceph cluster using the rook operator. During drain, the volumes are detatched, however it might take some time to propagate the the kernel unmount. I have not looked into details, but according to @martin31821 this is caused by the ceph kernel client doing some foo during unmount, thus trying to change this from userspace is not possible. https://github.com/kinvolk/flatcar-linux-update-operator/pull/62 introduces a quick workaround by just adding some wait time after draining the node.

martin31821 commented 3 years ago

Maybe we can solve this by introducing a possibility to run one or more kubernetes jobs prior to rebooting, which could be used e.g. to change DNS records, wait a certain amount of time or run host commands prior to rebooting.

invidian commented 3 years ago

Not ideal, but I guess we could test against that on Lokomotive, as we have there a pipeline testing FLUO and Rook together. CC @surajssd

invidian commented 3 years ago

Note: existing capabilities for running hooks runs before node is drained, which indeed can make it impossible right now to deploy a custom hook which could ensure it. Perhaps this could be addressed.

invidian commented 3 years ago

As part of #37, I'm analyzing how FLUO works in details, as there is no documentation or tests and what comes to my mind is, that perhaps hooks model could be extended, so it's possible to run a workflow between each significant action taken, which would be:

However, existing state tracking model is overly complex and right now I don't feel comfortable adding another step to it. Perhaps we try to simplify it first, then extend with extra hook.

Nuckal777 commented 3 years ago

We are affected by this is well. Some seconds of sleep after draining, like in https://github.com/flatcar-linux/flatcar-linux-update-operator/pull/62 would help mitigating.

invidian commented 2 years ago

Just realized I think I hit this issue on my cluster as well :smile:

invidian commented 2 years ago

Created PoC/draft PR to play around with this and things seem to improve nicely: https://github.com/flatcar-linux/flatcar-linux-update-operator/pull/169.