no daemon pod found (node exists) - daemon also exists and running

sunya-ch commented 1 year ago

[ ] checked https://foundation-model-stack.github.io/multi-nic-cni/user_guide/troubleshooting/.
[ ] titled with the bug issue (if applicable).
[ ] provided corresponding information regarding the troubleshooting guidelines (CR list/detail, multi-nic cni controller and/or daemon status/log).

Describe the bug A clear and concise description of what the bug is.

Daemon pod is running but after controller has been restarted, it cannot cache the daemon pod. This happens in a large scale cluster that API server is mostly busy.

> cannot UpdateCurrentList: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)

The multi-nic chi cannot list the running daemon pod at the beginning.

no daemon pod found (node exists)

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Need to update current list of daemon if found missing periodically.

Screenshots If applicable, add screenshots to help explain your problem.

manager container of controller and multi-nicd DS status:
multinicnetwork CR:
hostinterface list/CR:
cidr CR (multiNICIPAM: true):
ippools CR (multiNICIPAM: true):
log of manager container:
log of failed multi-nicd pod:

Environment (please complete the following information):

platform: [e.g. self-managed k8s, self-managed OpenShift, EKS, IKS, AKS]
node profile:
operator version :
cluster scale (number of nodes, pods, interfaces):

Additional context Add any other context about the problem here.

sunya-ch commented 1 year ago

WIP: https://github.com/sunya-ch/multi-nic-cni/commit/0047f12a7500fed42e2d61acc79261ded44d67e4

sunya-ch commented 1 year ago

should be fixed by https://github.com/foundation-model-stack/multi-nic-cni/pull/119

foundation-model-stack / multi-nic-cni

no daemon pod found (node exists) - daemon also exists and running #118