[BUG] The revert-cgroups deployment and container to revert to cgroups v1 intermittently does not reboot the node

idmurphy commented 1 year ago

Describe the bug The revert-cgroups deployment and container to revert to cgroups v1 intermittently does not reboot the node

We have installed the daemonset released from AKS team for reverting cgroups to v1 here: https://github.com/Azure/AKS/blob/master/examples/cgroups/revert-cgroup-v1.yaml

However, we have seen a few occasions where one or more of the nodes in the AKS cluster didn't get rebooted, although the cgroup-version label was added to the node. This is highly unpredictable given nodes can get scaled up and down, and therefore leave application pods in non-working states.

To Reproduce Steps to reproduce the behavior:

Deploy revert-cgroup-v1.yaml into a new AKS cluster
Perform actions to scale up and down the nodes
Intermittently, you may observer a node does not get rebooted, although the label is applied, and therefore cgroups v2 is actually active even though the label says v1

Expected behavior The cgroup-version label on the node should only be applied once it is known that cgruops-v1 is active, and reboot must always therefore occur.

Screenshots If applicable, add screenshots to help explain your problem.

last reboot at 06:38 and still in cgroupv2

grub file updated at 06:40 by the revert-cgroup

Environment (please complete the following information): -AKS 1..26

Additional context When we connected to the node to check the grub file, we found it was set with the cgroups v1 as per the above revert-cgroups. When we check the last reboot time of the node, and the time of the update to the grub file, we see the last reboot was before the grub file got updated, therefore we can conclude the 'reboot' line in the rever-cgroups didn't run.

What we believe is happening, is that since the labelling of the node using kubectl occurs before the reboot line (i.e. on line 48 of the above yaml), the pod can get de-schduled from the node before the reboot line, hence there is a race condition going on.

We have tested by removing line 48 and this works as expected - and this also means the node does not get labelled until after the reboot and the second time it will re-check, if cgroupsv1 is active it will therefore enter the else and label the node per line 51.

In addition, the revert-cgroups doesn't have any pod priority class assigned, this means there is a risk also that k8s schedules other pods to the node before this one. Therefore, would request that the pod priority class is added.

Please confirm if

The issue we have reported is a risk and therefore the revert-cgroup-v1.yaml can be updated as we have done?
if podPriorityClass for system-node-critical can be added to ensure it's prioritization for scheduling?

idmurphy commented 9 months ago

I resolved by removing line 48 (i.e. the kubectl before the reboot) and adding the mentioned podPriorityClass into the yaml file and used this modified version. We haven't seen the issue since.