foundation-model-stack / multi-nic-cni

https://foundation-model-stack.github.io/multi-nic-cni/
Apache License 2.0
34 stars 5 forks source link

crashLoopBackOff: controller never recovered or cannot restart when node was tainted during operation. #32

Closed sunya-ch closed 1 year ago

sunya-ch commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

The manager container of multi-nic-cni-operator-controller-manager controller pod hangs in CrashLoopBackOff state and never recovered.

Even if we untainted the pod, the controller was also not recovered.

To Reproduce Steps to reproduce the behavior:

  1. Run the operator
  2. Taint the node
    kubectl taint node <node name> temporary=true:NoSchedule
  3. Restart the controller pod (or wait for sync period)

Expected behavior A clear and concise description of what you expected to happen.

The controller should be recovered.

Screenshots If applicable, add screenshots to help explain your problem.

multi-nic-cni-operator-controller-manager-54769c6bcb-ttjdz        1/2     CrashLoopBackOff   5 (27s ago)   6m29s

Environment (please complete the following information):

Additional context Add any other context about the problem here.

When the node was tainted, the pod of multi-nicd daemonset was not deleted and running but connection to controller was broken.

Shortcoming Solutiton

If we do the following steps the controller will be recovered.

  1. Keep the node tainted
  2. Manually delete the daemon pod on the tainted node.
  3. Restart the controller pod.

Note that after untainted the node and restart, the controller went back to the hang state. However, after delete deployment, it was back to operation.

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

sunya-ch commented 1 year ago

It turns out to be the hang due to the podQueue of DaemonWatcher reach the limit (MaxQSize = 100) at the initial syncing phase (before dequeuing).

Should be fixed by https://github.com/foundation-model-stack/multi-nic-cni/pull/36

This PR gets rid of the need of queuing before the dequeue.