Open hbhasker opened 1 week ago
The CNI today only reconciles its datastore with existing pods at startup but never again.
This is done periodically. That's how the datastore keeps track of the ips used, available and does a ENI call if the additional ENI is needed.
The CNI today only reconciles its datastore with existing pods at startup but never again.
This is done periodically. That's how the datastore keeps track of the ips used, available and does a ENI call if the additional ENI is needed.
I guess I meant reconcile the allocated IPs against running pods. It doesn't check in steady state that a pod is still around when its IP is allocated. The assumption is that if the pod was deleted the IPAMD would have been called by the CNI plugin. But I think there are some cases where this assumption breaks down, like if the CNI->ipamd grpc call fails for any reason. There are probably other edge cases but I couldn't fully reason out what could cause this to happen.
ping for a review!
What type of PR is this?
improvement
Which issue does this PR fix?:
Fixes #3109
What does this PR do / Why do we need it?: The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime.
In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods.
This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse.
Fixes #3109
Testing done on this change:
Added unit tests.
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades? Has updating a running cluster been tested?: No
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.