Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
375 stars 235 forks source link

Azure CNI: timed out locking store #2817

Closed behzad-mir closed 1 month ago

behzad-mir commented 2 months ago

When large scale of pods ( >150) will be created in parallel Azure CNI will fail with this error: Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"c99e647e8519b46b4c492d3076ea2da13e861a39ccb982aca89eb5ec5cafae55\": plugin type=\"azure-vnet\" failed (add): Failed to initialize key-value store of network plugin: error Acquiring store lock: timed out locking store" pod="default/k8-parallel-100-joblh8pf-gvn6s"

behzad-mir commented 2 months ago

The issue is due to the serialized approach of Azure CNI during the pod creation. each CNI process will acquire lock at the beginning of the process and release it at the end and when CNI add calls take place in parallel in large numbers some of them will fail waiting behind the lock. The issue is seen more in Windows.

To address the issue a new CNI version called Statless CNI has been designed and implemented that enable paralle pod creation and removes the process locks. https://github.com/Azure/azure-container-networking/pull/2276

The first target is for Windows AKS Swift Scenario and rollout has started for K8s 1.30

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] commented 1 month ago

Issue closed due to inactivity.