Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
375 stars 236 forks source link

azure-vnet can leave lock behind if called on a loaded system #464

Closed PatrickLang closed 4 years ago

PatrickLang commented 4 years ago

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue

Which release version?: 1.0.29

Which component (CNI/IPAM/CNM/CNS): CNI

Which Operating System (Linux/Windows): Windows

For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"


Which Orchestrator and version (e.g. Kubernetes, Docker): Kubernetes 1.17

What happened:

Under a heavily loaded system - azure-vnet.exe terminated without logging any result or deleting the lock file. All azure-vnet calls since then have failed.

azure-vnet.json.lock contents are 7740

These are all the logs from the process 7740 that terminated without removing the lock.

2019/12/19 20:02:10 [7740] Connected to telemetry service
2019/12/19 20:02:10 [7740] [cni-net] Plugin azure-vnet version v1.0.29.
2019/12/19 20:02:10 [7740] [cni-net] Running on windows
2019/12/19 20:02:16 [7740] [net] Network interface: {Index:6 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:c2:a0:b5 Flags:up|broadcast|multicast} with IP: [fe80::8c4c:911b:131f:132a/64 10.240.0.128/12]
2019/12/19 20:02:17 [7740] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2019/12/19 20:02:19 [7740] [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:76:8d:b5 Flags:up|broadcast|multicast} with IP: [fe80::dc51:edd1:aefe:f886/64 172.25.16.1/20]

Other processes following that one also logged the same failure:

2019/12/19 15:31:26 [21564] Formatted Boot time: 2019-12-12 22:58:19
2019/12/19 15:31:26 [21564] [Azure-Utils] ver
2019/12/19 15:31:26 [21564] Connected to telemetry service
2019/12/19 15:31:26 [21564] [cni-net] Plugin azure-vnet version v1.0.29.
2019/12/19 15:31:26 [21564] [cni-net] Running on windows
2019/12/19 15:31:26 [21564] [net] Network interface: {Index:6 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:c2:a0:b5 Flags:up|broadcast|multicast} with IP: [fe80::8c4c:911b:131f:132a/64 10.240.0.128/12]
2019/12/19 15:31:26 [21564] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2019/12/19 15:31:26 [21564] [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:76:8d:b5 Flags:up|broadcast|multicast} with IP: [fe80::dc51:edd1:aefe:f886/64 172.25.16.1/20]
2019/12/19 15:31:26 [21564] [net] network store key not found
2019/12/19 15:31:26 [21564] [cni-net] Plugin started.
2019/12/19 15:31:26 [21564] [cni-net] Plugin stopped.
2019/12/19 15:31:26 [21564] Sending report succeeded

Once the system load was relieved, from this point on, all azure-vnet calls are failing:

2019/12/19 20:13:43 [16676] Formatted Boot time: 2019-12-12 22:58:19
2019/12/19 20:13:43 [16676] [Azure-Utils] ver
2019/12/19 20:14:11 [16676] Formatted Boot time: 2019-12-12 22:58:19
2019/12/19 20:14:11 [16676] [cni] reboot time 2019-12-12 22:58:19 +0000 UTC storeLockFile mod time 2019-12-19 20:02:05.2168854 +0000 UTC
2019/12/19 20:17:04 [16676] [cni] Failed to lock store: timed out locking store.
2019/12/19 20:17:04 [16676] Failed to initialize key-value store of network plugin, err:timed out locking store.
2019/12/19 20:17:04 [16676] Report plugin error

What you expected to happen: No lock leaks


How to reproduce it (as minimally and precisely as possible):

Deploy this and scale to > 4x the number of CPUs the node has

apiVersion: apps/v1
kind: Deployment
metadata:
  name: burn500m
  labels:
    app: burn500m
spec:
  replicas: 3
  template:
    metadata:
      name: burn500m
      labels:
        app: burn500m
    spec:
      containers:
      - name: windowswebserver
        resources:
          limits:
            cpu: .5
            memory: 1Gi
          requests:
            cpu: .25
            memory: 1Gi
        image: mcr.microsoft.com/windows/servercore:ltsc2019
        command:
        - powershell.exe
        - -command
        - "foreach ($loopnumber in 1..8) { Start-Job -ScriptBlock { $result = 1; foreach($mm in 1..2147483647){$res1=1;foreach($num in 1..2147483647){$res1=$mm*$num*1340371};$res1} } } ; get-job | wait-job"
      nodeSelector:
        "beta.kubernetes.io/os": windows
  selector:
    matchLabels:
      app: burn500m

Anything else we need to know:

This also causes some difficult to understand errors in kubelet logs

W1219 23:13:16.676189   14772 cni.go:237] Unable to update cni config: no valid networks found in c:\k\azurecni\netconf
E1219 23:13:29.295807   14772 cni.go:364] Error adding default_burn2-7b86d4db4f-m2zbc/2a108f19f55a5fd4e2572182c3cdb90280309c4626a35fe31c4a1c1173897c49 to network azure-vnet/azure: unexpected end of JSON input
W1219 23:13:43.276619   14772 cni.go:202] Error validating CNI config list {
    "cniVersion":  "0.3.0",
    "name":  "azure",
    "plugins":  [
                    {
                        "type":  "azure-vnet",
                        "mode":  "bridge",
                        "bridge":  "azure0",
                        "capabilities":  {
                                             "portMappings":  true,
                                             "dns":  true
                                         },
                        "ipam":  {
                                     "type":  "azure-vnet-ipam"
                                 },
                        "dns":  {
                                    "Nameservers":  [
                                                        "10.0.0.10",
                                                        "168.63.129.16"
                                                    ],
                                    "Search":  [
                                                   "svc.cluster.local"
                                               ]
                                },
                        "AdditionalArgs":  [
                                               {
                                                   "Name":  "EndpointPolicy",
                                                   "Value":  {
                                                                 "Type":  "OutBoundNAT",
                                                                 "ExceptionList":  [
                                                                                       "10.0.0.0/8",
                                                                                       "10.240.0.0/12"
                                                                                   ]
                                                             }
                                               },
                                               {
                                                   "Name":  "EndpointPolicy",
                                                   "Value":  {
                                                                 "Type":  "ROUTE",
                                                                 "DestinationPrefix":  "10.0.0.0/16",
                                                                 "NeedEncap":  true
                                                             }
                                               }
                                           ]
                    }
                ]
}
: [decoding version info: unexpected end of JSON input]
W1219 23:13:43.276619   14772 cni.go:237] Unable to update cni config: no valid networks found in c:\k\azurecni\netconf
E1219 23:13:51.423887   14772 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "2a108f19f55a5fd4e2572182c3cdb90280309c4626a35fe31c4a1c1173897c49" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input

And these events are passed back to Kubernetes, visible in kubectl describe pod ...

Events:
  Type     Reason                  Age                    From                 Message
  ----     ------                  ----                   ----                 -------
  Warning  FailedScheduling        <unknown>              default-scheduler    0/6 nodes are available: 6 Insufficient cpu.
  Warning  FailedScheduling        <unknown>              default-scheduler    0/6 nodes are available: 6 Insufficient cpu.
  Normal   Scheduled               <unknown>              default-scheduler    Successfully assigned default/burn2-7b86d4db4f-m2zbc to 2758k8s010
  Warning  FailedCreatePodSandBox  9m1s                   kubelet, 2758k8s010  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8931a83501dcb167bf10e9365d2a7f8fe4bd125ddc04f8370b08bd7c3a799e50" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
  Warning  FailedCreatePodSandBox  6m2s                   kubelet, 2758k8s010  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "2044513a9f09b702c426085091d420d129292e62979d10b77b053f3002e27ebc" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
  Warning  FailedSync              5m18s (x4 over 5m41s)  kubelet, 2758k8s010  error determining status: rpc error: code = Unknown desc = Error: No such container: 8931a83501dcb167bf10e9365d2a7f8fe4bd125ddc04f8370b08bd7c3a799e50
  Warning  FailedCreatePodSandBox  4m12s                  kubelet, 2758k8s010  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b32c54cb368f0c37f769a795ceb1c8a10dfb58db54dc3152c2f71cf9caeaff54" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
  Warning  FailedCreatePodSandBox  2m43s                  kubelet, 2758k8s010  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "af3d169eb3390c9692d78fef343dbb9959fcb523d981000af0127f79a29c5467" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
  Warning  FailedCreatePodSandBox  74s                    kubelet, 2758k8s010  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4cd552d3e6df57530227ea9a4a633eea05c62d87cf53ba765164545124e79391" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
  Normal   SandboxChanged          52s (x5 over 7m28s)    kubelet, 2758k8s010  Pod sandbox changed, it will be killed and re-created.

tamilmani1989 commented 4 years ago

@PatrickLang We fixed an issue related to lock timeout in 1.0.30(#445). This will take care of removing lockfile if the process holding that lock was exited. Check the pid printed in azure-vnet.json.lock under c:\k\ and check if the process is still running. If its running can you get processdump of that pid ?