Closed thomastaylor312 closed 6 years ago
When update-agent starts on a node, it unconditionally sets the node schedulable again.
It would be more correct to do this only if reboot-in-progress and reboot-needed were previously true, but that opens up the possibility the program could crash between checking annotations and marking the node schedulable again, potentially leaving nodes unschedulable. We'd need to do this atomically, use some additional piece of state, or perhaps have the operator take on schedulability decisions.
It looks like #176 should solve this problem. It adds an annotation to the node if the agent was responsible for marking the node unschedulable, and the agent only marks it as schedulable again if it finds that annotation.
I'll update this bug again when a release is tagged with this fix.
v0.7.0
is released, which includes #176 and should fix this issue. See the release page for details - https://github.com/coreos/container-linux-update-operator/releases/tag/v0.7.0
Thanks for the work on this @sdemos!
In a case where the CLUO update agent pod is killed or restarts on a node that is currently cordoned, it will mark the node as schedulable again. I am guessing it sees the annotations and thinks it just finished a reboot. Example below: