Closed PatrickLang closed 4 years ago
@PatrickLang We fixed an issue related to lock timeout in 1.0.30(#445). This will take care of removing lockfile if the process holding that lock was exited. Check the pid printed in azure-vnet.json.lock under c:\k\ and check if the process is still running. If its running can you get processdump of that pid ?
Is this an ISSUE or FEATURE REQUEST? (choose one): Issue
Which release version?: 1.0.29
Which component (CNI/IPAM/CNM/CNS): CNI
Which Operating System (Linux/Windows): Windows
For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"
Which Orchestrator and version (e.g. Kubernetes, Docker): Kubernetes 1.17
What happened:
Under a heavily loaded system - azure-vnet.exe terminated without logging any result or deleting the lock file. All azure-vnet calls since then have failed.
azure-vnet.json.lock contents are
7740
These are all the logs from the process
7740
that terminated without removing the lock.Other processes following that one also logged the same failure:
Once the system load was relieved, from this point on, all azure-vnet calls are failing:
What you expected to happen: No lock leaks
How to reproduce it (as minimally and precisely as possible):
Deploy this and scale to > 4x the number of CPUs the node has
Anything else we need to know:
This also causes some difficult to understand errors in kubelet logs
And these events are passed back to Kubernetes, visible in
kubectl describe pod ...