Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
377 stars 241 forks source link

[cni] Timed out on locking store, err:Store is locked. #242

Closed msuiche closed 6 years ago

msuiche commented 6 years ago

Is this a request for help?:

Yes.

Is this an ISSUE or FEATURE REQUEST? (choose one):

ISSUE.

Which release version?:

Plugin azure-vnet version v1.0.10.

Which component (CNI/IPAM/CNM/CNS):

CNI

Which Operating System (Linux/Windows):

Windows Server 1803

PS C:\k> $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion
10.0.17134.1 (WinBuild.160101.0800)

Which Orchestrator and version (e.g. Kubernetes, Docker)

S C:\k> .\kubectl.exe version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"windows/amd64"}

What happened: Windows containers can't start.

2018/08/31 04:37:22 [cni] Timed out on locking store, err:Store is locked.
2018/08/31 04:37:22 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/08/31 04:37:22 Report plugin error
2018/08/31 04:37:22 [cni] Timed out on locking store, err:Store is locked.
2018/08/31 04:37:22 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/08/31 04:37:22 Report plugin error
2018/08/31 04:37:22 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/08/31 04:37:22 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet [] 
                Context AzureCNI SubContext "
2018/08/31 04:37:22 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/31 04:37:22 OSDetails &{windows    }
2018/08/31 04:37:22 SystemDetails &{0 0 0 0 0 0 }
2018/08/31 04:37:22 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/08/31 04:37:22 BridgeDetails <nil>
2018/08/31 04:37:22 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/08/31 04:37:22 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet [] 
                Context AzureCNI SubContext "
2018/08/31 04:37:22 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/31 04:37:22 OSDetails &{windows    }
2018/08/31 04:37:22 SystemDetails &{0 0 0 0 0 0 }
2018/08/31 04:37:22 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/08/31 04:37:22 BridgeDetails <nil>
2018/08/31 04:37:22 Send telemetry success 200
2018/08/31 04:37:22 SetReportState succeeded
2018/08/31 04:37:22 Send telemetry success 200
2018/08/31 04:37:22 SetReportState succeeded
E0831 04:37:22.384629    3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:22.384629    3540 remote_runtime.go:115] StopPodSandbox "982a32ac758ba632a7f5feb185f9c27ad2a83899c2678e44f1f27fe7edf98470" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-kdr8r_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:22.385560    3540 kuberuntime_gc.go:153] Failed to stop sandbox "982a32ac758ba632a7f5feb185f9c27ad2a83899c2678e44f1f27fe7edf98470" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-kdr8r_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:22.386548    3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:22.388545    3540 remote_runtime.go:115] StopPodSandbox "c6ae99e68ba79e053cd0e9dbb9790a18ba2c87ba77e897e0133aee2d81b260ff" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-56957d9b4f-n74ng_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:22.388545    3540 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "c6ae99e68ba79e053cd0e9dbb9790a18ba2c87ba77e897e0133aee2d81b260ff"}
E0831 04:37:22.388545    3540 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "10373f27-ac8d-11e8-957b-000d3a2fb36b" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"omega-56957d9b4f-n74ng_default\" network: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
E0831 04:37:22.388545    3540 pod_workers.go:186] Error syncing pod 10373f27-ac8d-11e8-957b-000d3a2fb36b ("omega-56957d9b4f-n74ng_default(10373f27-ac8d-11e8-957b-000d3a2fb36b)"), skipping: failed to "KillPodSandbox" for "10373f27-ac8d-11e8-957b-000d3a2fb36b" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"omega-56957d9b4f-n74ng_default\" network: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
I0831 04:37:22.980153    3540 kuberuntime_manager.go:403] No ready sandbox for pod "omega-56957d9b4f-vvc4b_default(103ab3cc-ac8d-11e8-957b-000d3a2fb36b)" can be found. Need to start a new one
E0831 04:37:23.523553    3540 helpers.go:735] eviction manager: failed to construct signal: "allocatableMemory.available" error: system container "pods" not found in metrics
W0831 04:37:23.523553    3540 helpers.go:808] eviction manager: no observation found for eviction signal nodefs.inodesFree
2018/08/31 04:37:24 [cni] Timed out on locking store, err:Store is locked.
2018/08/31 04:37:24 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/08/31 04:37:24 Report plugin error
2018/08/31 04:37:24 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/08/31 04:37:24 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet [] 
                Context AzureCNI SubContext "
2018/08/31 04:37:24 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/31 04:37:24 OSDetails &{windows    }
2018/08/31 04:37:24 SystemDetails &{0 0 0 0 0 0 }
2018/08/31 04:37:24 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/08/31 04:37:24 BridgeDetails <nil>
2018/08/31 04:37:24 Send telemetry success 200
2018/08/31 04:37:24 SetReportState succeeded
E0831 04:37:24.725070    3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:24.727071    3540 remote_runtime.go:115] StopPodSandbox "fdd2080644a1a376a1bcd7c24bd10b2c276dc738fd1a7315cf99c89ac1ea4a44" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-kdr8r_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:24.727071    3540 kuberuntime_gc.go:153] Failed to stop sandbox "fdd2080644a1a376a1bcd7c24bd10b2c276dc738fd1a7315cf99c89ac1ea4a44" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-kdr8r_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
2018/08/31 04:37:25 [cni] Timed out on locking store, err:Store is locked.
2018/08/31 04:37:25 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/08/31 04:37:25 Report plugin error
2018/08/31 04:37:25 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/08/31 04:37:25 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet [] 
                Context AzureCNI SubContext "
2018/08/31 04:37:25 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/31 04:37:25 OSDetails &{windows    }
2018/08/31 04:37:25 SystemDetails &{0 0 0 0 0 0 }
2018/08/31 04:37:25 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/08/31 04:37:25 BridgeDetails <nil>
2018/08/31 04:37:25 Send telemetry success 200
2018/08/31 04:37:25 SetReportState succeeded
E0831 04:37:25.292113    3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:25.292113    3540 remote_runtime.go:115] StopPodSandbox "54d875a7d4c241270bb044e618d70d7563d89a1782c5fc4736aa2dcaf8d20c35" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-56957d9b4f-vvc4b_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:25.292113    3540 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "54d875a7d4c241270bb044e618d70d7563d89a1782c5fc4736aa2dcaf8d20c35"}
E0831 04:37:25.292113    3540 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "103ab3cc-ac8d-11e8-957b-000d3a2fb36b" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"omega-56957d9b4f-vvc4b_default\" network: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
E0831 04:37:25.293019    3540 pod_workers.go:186] Error syncing pod 103ab3cc-ac8d-11e8-957b-000d3a2fb36b ("omega-56957d9b4f-vvc4b_default(103ab3cc-ac8d-11e8-957b-000d3a2fb36b)"), skipping: failed to "KillPodSandbox" for "103ab3cc-ac8d-11e8-957b-000d3a2fb36b" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"omega-56957d9b4f-vvc4b_default\" network: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
2018/08/31 04:37:26 [cni] Timed out on locking store, err:Store is locked.
2018/08/31 04:37:26 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/08/31 04:37:26 Report plugin error
2018/08/31 04:37:26 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/08/31 04:37:26 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet [] 
                Context AzureCNI SubContext "
2018/08/31 04:37:26 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/31 04:37:26 OSDetails &{windows    }
2018/08/31 04:37:26 SystemDetails &{0 0 0 0 0 0 }
2018/08/31 04:37:26 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/08/31 04:37:26 BridgeDetails <nil>
2018/08/31 04:37:26 Send telemetry success 200
2018/08/31 04:37:26 SetReportState succeeded
E0831 04:37:27.004305    3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:27.004305    3540 remote_runtime.go:115] StopPodSandbox "5d7824902507684675fb387f26c64b5cfd8a8275bacf153b09a130676154ce0f" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-kdr8r_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0831 04:37:27.004305    3540 kuberuntime_gc.go:153] Failed to stop sandbox "5d7824902507684675fb387f26c64b5cfd8a8275bacf153b09a130676154ce0f" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-kdr8r_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

What you expected to happen:

No errors.

How to reproduce it (as minimally and precisely as possible):

We just have an hybrid cluster, with one Windows node. This consistantly happens each time we start a container. This problem appeared when we migrated to Windows Server 1803.

Anything else we need to know:

Two related issues that have been open for several months now. https://github.com/Azure/azure-container-networking/issues/145 https://github.com/Azure/azure-container-networking/issues/232

msuiche commented 6 years ago

Some additional logs from azure-vnet-ipam.log


2018/08/30 12:10:49 [cni-ipam] Processing ADD command with args {ContainerID:70869b140d13a70a2cab50257d68886602124e704ccb0b2c551f53652be30a68 Netns:default IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=omega-6758fb5498-pfxnk;K8S_POD_INFRA_CONTAINER_ID=70869b140d13a70a2cab50257d68886602124e704ccb0b2c551f53652be30a68 Path:c:\k\azurecni\bin}.
2018/08/30 12:10:49 [cni-ipam] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: MultiTenancy:false EnableSnatOnHost:false Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet:10.240.0.0/12 Address: QueryInterval:} DNS:{Nameservers:[10.0.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 44 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 48 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]}]}.
2018/08/30 12:10:49 [ipam] Starting source azure.
2018/08/30 12:10:49 [ipam] Refreshing address source.
2018/08/30 12:10:49 [ipam] Source refresh failed, err:expected element type <Interfaces> but have <html>.
2018/08/30 12:10:49 [ipam] Requesting address with address: options:map[].
2018/08/30 12:10:49 [ipam] Address request completed with address:10.240.0.146/12 err:<nil>.
2018/08/30 12:10:49 [ipam] Save succeeded.
2018/08/30 12:10:49 [cni-ipam] Allocated address 10.240.0.146/12.
2018/08/30 12:10:49 [cni-ipam] ADD command completed with result:IP:[{Version:4 Interface:<nil> Address:{IP:10.240.0.146 Mask:fff00000} Gateway:10.240.0.1}], Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} GW:10.240.0.1}], DNS:{Nameservers:[168.63.129.16] Domain: Search:[] Options:[]} err:<nil>.
2018/08/30 12:10:49 [cni-ipam] Plugin stopped.

and azure-vnet.log that occured yesterday and may be related

2018/08/30 12:10:28 [cni-net] Plugin azure-vnet version v1.0.10.
2018/08/30 12:10:28 [cni-net] Running on windows
2018/08/30 12:10:28 [net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:2f:bd:ac Flags:up|broadcast|multicast} with IP addresses: [fe80::e59e:d65:cb33:6df7/64 10.240.0.127/12]
2018/08/30 12:10:28 [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]
2018/08/30 12:10:28 [net] Network interface: {Index:10 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:08:4a:d8 Flags:up|broadcast|multicast} with IP addresses: [fe80::7428:956b:359:c173/64 172.25.112.1/20]
2018/08/30 12:10:28 os.stat() for file azure-vnet.json failed with error CreateFile azure-vnet.json: The system cannot find the file specified.
2018/08/30 12:10:28 [net] Restored state, &{Version:v1.0.10 TimeStamp:0001-01-01 00:00:00 +0000 UTC ExternalInterfaces:map[] store:0xc0421326f0 Mutex:{state:0 sema:0}}
2018/08/30 12:10:28 [cni-net] Plugin started.
sharmasushant commented 6 years ago

@msuiche Can you please attach the complete log files azure-vnet.log and azure-vnet-ipam.log We would like to take a look at the time when you started seeing the file lock error. It seems like there is a corner case where the code is not cleaning up the lock file (most likely a failure case). Note that the log file with relevant time stamp may be suffixed with .1, .2, etc.

Thanks.

msuiche commented 6 years ago

Please find the requested file attached. Thanks. azure-vnet-ipam.log azure-vnet.log

PatrickLang commented 6 years ago

@sharmasushant do you need more info here?

I see this when multiple containers are starting at the same time. Have you considered adding a retry/backoff loop to avoid returning this on overlapped CNI calls?

PatrickLang commented 6 years ago

Also #194 may be same issue

msuiche commented 6 years ago

@PatrickLang Yes, this problem is definitely a regression we didn't have it before.

sharmasushant commented 6 years ago

@musiche the below logs that you pasted in the issue are not here in the attached files

2018/08/31 04:37:22 [cni] Timed out on locking store, err:Store is locked. 2018/08/31 04:37:22 Failed to initialize key-value store of network plugin, err:Store is locked. 2018/08/31 04:37:22 Report plugin error

Can you please share the azure-vnet.log.* files with more recent timestamps? I want to take a look at the logs when the issue started happening. The log file that you have attached is only upto 2018/08/30 12:10:28

msuiche commented 6 years ago

@sharmasushant The messages are from kubelet.err.log

cni.go:280 fails.

E0917 08:00:10.188905    3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0917 08:00:10.188905    3540 remote_runtime.go:115] StopPodSandbox "515c41fd6c901e7323d1aa4b8feea46e05002ee95cf9a753d2e25045f85dbb73" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-hvxtq_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0917 08:00:10.188905    3540 kuberuntime_gc.go:153] Failed to stop sandbox "515c41fd6c901e7323d1aa4b8feea46e05002ee95cf9a753d2e25045f85dbb73" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-hvxtq_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
I0917 08:00:10.310902    3540 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "83903k8s9010"
2018/09/17 08:00:12 [cni] Timed out on locking store, err:Store is locked.
2018/09/17 08:00:12 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/09/17 08:00:12 Report plugin error

Some more information:



PS C:\k> .\azurecni\bin\azure-vnet.exe --version
2018/09/17 08:10:11 [cni] Timed out on locking store, err:Store is locked.
2018/09/17 08:10:11 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/09/17 08:10:11 Report plugin error
2018/09/17 08:10:11 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/09/17 08:10:11 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet []
                                Context AzureCNI SubContext "
2018/09/17 08:10:11 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/09/17 08:10:11 OSDetails &{windows    }
2018/09/17 08:10:11 SystemDetails &{0 0 0 0 0 0 }
2018/09/17 08:10:11 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/09/17 08:10:11 BridgeDetails <nil>
2018/09/17 08:10:11 Send telemetry success 200
2018/09/17 08:10:11 SetReportState succeeded
PS C:\k> .\azurecni\bin\azure-vnet-ipam.exe --version
CNI_COMMAND env variable missing
v1.0.10
{
    "code": 100,
    "msg": "required env variables missing"
}
PS C:\k> docker version
Client:
 Version:      17.06.2-ee-16
 API version:  1.30
 Go version:   go1.8.7
 Git commit:   9ef4f0a
 Built:        Thu Jul 26 16:43:19 2018
 OS/Arch:      windows/amd64

Server:
 Engine:
  Version:      17.06.2-ee-16
  API version:  1.30 (minimum version 1.24)
  Go version:   go1.8.7
  Git commit:   9ef4f0a
  Built:        Thu Jul 26 16:52:17 2018
  OS/Arch:      windows/amd64
  Experimental: false

PS C:\k> docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
PS C:\k>
PatrickLang commented 6 years ago

@sharmasushant - do you need any more info on this failure? Is there any ETA for a fix?

msuiche commented 6 years ago

As far as I understand this is far to be an isolated issue as I've seen multiple issues with it. It's 100% reproducible on our side, we basically can't start a Windows container at all. I'm sure this impacts a lot of customers.

sharmasushant commented 6 years ago

@msuiche we can try to reproduce this on our side. The logs that you are attaching do not have the error. To unblock yourself, you can delete azure-vnet.json.lock file

sharmasushant commented 6 years ago

@sharmasushant - do you need any more info on this failure? Is there any ETA for a fix?

@PatrickLang We can do the change to add more retries/adjust timeout to accommodate multiple parallel POD deployments. However, this issue looks a little different that we have not yet repro'ed.

lowenna commented 6 years ago

I don't know a lot about CNI (well anything), but there's certainly a few issues in this repo reading the code. I've put together some fixes in https://github.com/jhowardmsft/azure-container-networking/commit/a3868463451abb1b041915d856758ef761950eb5 and asked @PatrickLang to verify that internally before I put a PR in here.

sharmasushant commented 6 years ago

@jhowardmsft PR are always welcome. Thanks for your effort on working on these!

lowenna commented 6 years ago

@sharmasushant How can I build this in WSL? I'm trying to get azure-vnet.exe and azure-vnet-ipam.exe for Patrick, but make all-binaries in WSL is only generating Linux variants, although does build azure-vnet-plugin.exe:

.
./linux_amd64
./linux_amd64/cni
./linux_amd64/cni/10-azure.conflist
./linux_amd64/cni/azure-vnet
./linux_amd64/cni/azure-vnet-cni-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/cni/azure-vnet-ipam
./linux_amd64/cni-multitenancy
./linux_amd64/cni-multitenancy/10-azure.conflist
./linux_amd64/cni-multitenancy/azure-vnet
./linux_amd64/cni-multitenancy/azure-vnet-cni-multitenancy-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/cni-multitenancy/azure-vnet-ipam
./linux_amd64/cnm
./linux_amd64/cnm/azure-vnet-cnm-linux-amd64-v1.0.12-rc3-8-g009ebee-dirty.tgz
./linux_amd64/cnm/azure-vnet-cnm-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/cnm/azure-vnet-plugin
./linux_amd64/cns
./linux_amd64/cns/azure-cns
./linux_amd64/cns/azure-cns-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/npm
./linux_amd64/npm/azure-npm
./linux_amd64/npm/azure-npm-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./windows_amd64
./windows_amd64/cnm
./windows_amd64/cnm/azure-vnet-plugin.exe
lowenna commented 6 years ago

NVM, figured it :)

lowenna commented 6 years ago

Submitted https://github.com/Azure/azure-container-networking/pull/247 to fix this.

msuiche commented 6 years ago

Awesome. Will try this ASAP.

msuiche commented 6 years ago

@sharmasushant tried deleting the file as you suggested and I'm bumping into the

https://github.com/kubernets/kubernetes/issues/66087 issue that @PatrickLang reported few weeks back.


E0924 08:06:31.859825    5104 helpers.go:735] eviction manager: failed to construct signal: "allocatableMemory.available" error: system container "pods" not found in metrics
W0924 08:06:31.859825    5104 helpers.go:808] eviction manager: no observation found for eviction signal nodefs.inodesFree
I0924 08:06:31.988710    5104 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "83903k8s9010"
E0924 08:06:36.342989    5104 perfcounter_nodestats.go:144] Unable to get cpu perf counter data; err: unable to collect data from counter. Error code is 800007d6
E0924 08:06:41.463036    5104 helpers.go:735] eviction manager: failed to construct signal: "allocatableMemory.available" error: system container "pods" not found in metrics
W0924 08:06:41.464036    5104 helpers.go:808] eviction manager: no observation found for eviction signal nodefs.inodesFree
I0924 08:06:41.602036    5104 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "83903k8s9010"

And because of https://github.com/kubernetes/kubernetes/issues/65713 E0924 08:06:22.156451 5104 kubelet_network.go:102] Failed to ensure that nat chain KUBE-MARK-DROP exists: error creating chain "KUBE-MARK-DROP": executable file not found in %PATH%:

My containers are still Pending :(

Does @jhowardmsft fix also includes fixes to those?

PatrickLang commented 6 years ago

@msuiche - both of those errors are just noise in the logs. They won't prevent the pod from starting. Both are fixed in v1.12.

What does kubectl describe pod … show for that pod? Any other errors in the log?

msuiche commented 6 years ago
Name:           omega-7fbbb7d79b-blh85
Namespace:      default
Node:           <none>
Labels:         pod-template-hash=3966638356
                service=omega
Annotations:    <none>
Status:         Pending
IP:             
Controlled By:  ReplicaSet/omega-7fbbb7d79b
Containers:
  omega:
    Image:     **********
    Port:      3000/TCP
    Liveness:  http-get http://:3000/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
    Environment:
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=windows
Tolerations:     dedicated=windows:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  2m (x293631 over 6d)  default-scheduler  0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match node selector.
socotra:ergosphere msuiche$ 

@PatrickLang