CLUO updates node to be schedulable if pod is deleted

coreos / container-linux-update-operator

A Kubernetes operator to manage updates of Container Linux by CoreOS

Apache License 2.0

209 stars 49 forks source link

CLUO updates node to be schedulable if pod is deleted #138

Closed thomastaylor312 closed 6 years ago

thomastaylor312 commented 7 years ago

In a case where the CLUO update agent pod is killed or restarts on a node that is currently cordoned, it will mark the node as schedulable again. I am guessing it sees the annotations and thinks it just finished a reboot. Example below:

$ kubectl get nodes a.redacted.node
NAME                        STATUS                     AGE       VERSION
a.redacted.node             Ready,SchedulingDisabled   45d       v1.7.1+coreos.0
$ kubectl delete pod --namespace kube-system container-linux-update-agent-hc2qx
pod "container-linux-update-agent-hc2qx" deleted
$ kubectl get nodes a.redacted.node             
NAME              STATUS    AGE       VERSION
a.redacted.node   Ready     45d       v1.7.1+coreos.0

dghubble commented 7 years ago

When update-agent starts on a node, it unconditionally sets the node schedulable again.

It would be more correct to do this only if reboot-in-progress and reboot-needed were previously true, but that opens up the possibility the program could crash between checking annotations and marking the node schedulable again, potentially leaving nodes unschedulable. We'd need to do this atomically, use some additional piece of state, or perhaps have the operator take on schedulability decisions.

sdemos commented 6 years ago

It looks like #176 should solve this problem. It adds an annotation to the node if the agent was responsible for marking the node unschedulable, and the agent only marks it as schedulable again if it finds that annotation.

I'll update this bug again when a release is tagged with this fix.

sdemos commented 6 years ago

v0.7.0 is released, which includes #176 and should fix this issue. See the release page for details - https://github.com/coreos/container-linux-update-operator/releases/tag/v0.7.0

thomastaylor312 commented 6 years ago

Thanks for the work on this @sdemos!