foundation-model-stack / multi-nic-cni

https://foundation-model-stack.github.io/multi-nic-cni/
Apache License 2.0
33 stars 5 forks source link

Potentially hang at scale when listing the pods/ippools without selector #97

Closed sunya-ch closed 1 year ago

sunya-ch commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

As pods and ippools could be very large at scale, the CNI component (controller, and daemon) timely hang at calling List API.

For example,

https://github.com/foundation-model-stack/multi-nic-cni/blob/0879ec42963726ab10214f532ca5c30c787a30f4/controllers/cidr_handler.go#L703

https://github.com/foundation-model-stack/multi-nic-cni/blob/0879ec42963726ab10214f532ca5c30c787a30f4/daemon/src/backend/ippool.go#L76-L87

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

sunya-ch commented 1 year ago

The idea is to label IPPool resource with hostname and network name. Then, daemon can ListIPPool with options. Steps to live migrate from previous version are

  1. Restart controller, wait for config ready. This should add labels to existing ippools.
  2. Add NODENAME env to daemonset (make sure that imagePullPolicy: Always), wait for daemon to be all restarted.
sunya-ch commented 1 year ago

Should be fixed by ed3847b5eec787b16be70624e982ec78dd42cea8 (for pod listing at initial state), 6a63faa68989183ea6639cf46775d2e25021ce47 (for ippool listing by daemon), 56143875dbe95c0c07cad7b3895bcd073ee663bc (avoid hard error when daemon failed).