kubernetes.core.k8s_drain: drain can get stuck because pods are evicted in order

SUMMARY

Draining a node can get stuck because pods are evicted in order and not asynchronous. kubectl drains the pods asynchronous.

This is a problem if pods have dependencies on each other. Eg. like in #474 the longhorn instance manger can only be evicted after pods using a longhorn volume have been evicted (at least as far as I understand this). kubectl retries until the pod can be evicted, this seems to be missing in the ansible module.

Note

longhorn fixed the need for --pod-selector='app!=csi-attacher,app!=csi-provisioner' for kubectl drain .... But it can still be used as workaround in the ansible module.

I think it still makes sense to keep the functionality of this module close to kubectl drain ....

ISSUE TYPE

Bug Report

COMPONENT NAME

kubernetes.core.k8s_drain

ANSIBLE VERSION

ansible [core 2.16.6]
  config file = None
  configured module search path = ['/home/jb/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.12/site-packages/ansible
  ansible collection location = /home/jb/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.12.3 (main, Apr 23 2024, 09:16:07) [GCC 13.2.1 20240417] (/usr/bin/python)
  jinja version = 3.1.3
  libyaml = True

COLLECTION VERSION

Collection      Version
--------------- -------
kubernetes.core 2.4.2

CONFIGURATION

CONFIG_FILE() = None
EDITOR(env: EDITOR) = vim

OS / ENVIRONMENT

k3s on arch

STEPS TO REPRODUCE

Sorry I don't have an easy way to reproduce this. You need a pod that can not be evicted.

EXPECTED RESULTS

As many pods as possible should be evicted.

ACTUAL RESULTS

The eviction process gets stuck on pods that can not be evicted at that moment.

What is the error message you get when trying to drain the node?

With the ansible module I get the Too Many Request error:

fatal: [kube2.cave.$DOMAIN -> 127.0.0.1]: FAILED! => {"attempts": 12, "changed": false, "msg": "Failed to delete pod longhorn-system/instance-manager-db2cb9a890ef3db2d0af534090eb1fc2 due to: Too Many Requests"}

The output from kubectl looks like this:

~/R/m/cave (main|✚1) [2]$ kubectl drain --force --ignore-daemonsets --delete-emptydir-data --grace-period=10 kube2
node/kube2 already cordoned
Warning: ignoring DaemonSet-managed Pods: default/intel-gpu-plugin-5cp9g, kube-system/kube-vip-ds-nrlsz, longhorn-system/engine-image-ei-5cefaf2b-2kc48, longhorn-system/longhorn-csi-plugin-5qwp2, longhorn-system/longhorn-manager-qbdl7, node-feature-discovery/nfd-worker-l8wxc
evicting pod send2kodi/send2kodi-6b978cf87f-mfd2q
evicting pod jellyfin/jellyfin-0
evicting pod longhorn-system/longhorn-ui-655b65f7f9-fs4qf
evicting pod kube-system/metrics-server-54fd9b65b-96xsg
evicting pod longhorn-system/instance-manager-db2cb9a890ef3db2d0af534090eb1fc2
evicting pod node-feature-discovery/nfd-gc-54f888b58b-p2r5p
evicting pod monitoring/ping-exporter-5bc7857dc-pq8dt
error when evicting pods/"instance-manager-db2cb9a890ef3db2d0af534090eb1fc2" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/nfd-gc-54f888b58b-p2r5p evicted
pod/jellyfin-0 evicted
pod/ping-exporter-5bc7857dc-pq8dt evicted
pod/metrics-server-54fd9b65b-96xsg evicted
evicting pod longhorn-system/instance-manager-db2cb9a890ef3db2d0af534090eb1fc2
error when evicting pods/"instance-manager-db2cb9a890ef3db2d0af534090eb1fc2" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-db2cb9a890ef3db2d0af534090eb1fc2
error when evicting pods/"instance-manager-db2cb9a890ef3db2d0af534090eb1fc2" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/send2kodi-6b978cf87f-mfd2q evicted
pod/longhorn-ui-655b65f7f9-fs4qf evicted
evicting pod longhorn-system/instance-manager-db2cb9a890ef3db2d0af534090eb1fc2
pod/instance-manager-db2cb9a890ef3db2d0af534090eb1fc2 evicted
node/kube2 drained

My understanding from these messages is, that kubectl solves the Cannot evict pod as it would violate the pod's disruption budget. error from the longhorn instance-manager by evicting all other pods in parallel. Which results in the pod's disruption budget error being resolved. While the ansible module can not automatically solve the situation.

It's likely that kubectl eventually succeeds because it just keeps retrying the eviction. The k8s_drain module does not retry, it just makes a single eviction request for each pod on the node. I don't have any experience with longhorn, but https://github.com/longhorn/longhorn/issues/5910 seems to point to new settings that will handle evicting the instance manager. You would need to use retries to get the module to keep trying until longhorn has finished with the eviction. One way or another, the pod disruption budget will have to be met before a node can be drained.

So I found this thread after running into the same issue but finding that I can successfully evict pods with kubectl. Output below for affected pods:

error when evicting pods/"instance-manager-e-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"instance-manager-r-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-e-4235d552de1938f9f62d9b2829fa28a6
evicting pod longhorn-system/instance-manager-r-4235d552de1938f9f62d9b2829fa28a6
error when evicting pods/"instance-manager-e-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"instance-manager-r-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-e-4235d552de1938f9f62d9b2829fa28a6
evicting pod longhorn-system/instance-manager-r-4235d552de1938f9f62d9b2829fa28a6
error when evicting pods/"instance-manager-e-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"instance-manager-r-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-e-4235d552de1938f9f62d9b2829fa28a6
evicting pod longhorn-system/instance-manager-r-4235d552de1938f9f62d9b2829fa28a6
error when evicting pods/"instance-manager-e-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"instance-manager-r-4235d552de1938f9f62d9b2829fa28a6" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-e-4235d552de1938f9f62d9b2829fa28a6
evicting pod longhorn-system/instance-manager-r-4235d552de1938f9f62d9b2829fa28a6

The pods do eventually evict after several retries but this is all handled by kubectl without any further input.

kubernetes.core does not have the same behavior. Even after implementing 100 retries, the task still fails despite kubectl only needing a couple of retries. Implementing the pod selector to ignore the instance manager is a viable workaround for my particular use case but I agree that there should be a larger effort to handle this automatically the same way as kubectl. I suspect that, without the parallel execution handling other potentially dependant pods, kubernetes.core will not be able to handle this with any amount of retries unless one excludes the problematic pods. This is less than ideal for two reasons:

Kubernetes Administrators, Developers, and the Ansible developers all need to coordinate any new additions of pods with dependencies that need to be excluded from the selector.
Pods that are ignored are not drained and are less-gracefully terminated when the node reboots or is otherwise affected by the impetus driving the need to drain the node.

I suppose an additional alternative would be to just use kubectl drain <node> <args> as a shell command but this is not a viable workaround as it is simply choosing to forego kubernetes.core entirely which should be outside the scope of this issue.

ansible-collections / kubernetes.core