Closed msuiche closed 6 years ago
Some additional logs from
azure-vnet-ipam.log
2018/08/30 12:10:49 [cni-ipam] Processing ADD command with args {ContainerID:70869b140d13a70a2cab50257d68886602124e704ccb0b2c551f53652be30a68 Netns:default IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=omega-6758fb5498-pfxnk;K8S_POD_INFRA_CONTAINER_ID=70869b140d13a70a2cab50257d68886602124e704ccb0b2c551f53652be30a68 Path:c:\k\azurecni\bin}.
2018/08/30 12:10:49 [cni-ipam] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: MultiTenancy:false EnableSnatOnHost:false Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet:10.240.0.0/12 Address: QueryInterval:} DNS:{Nameservers:[10.0.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 44 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 48 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]}]}.
2018/08/30 12:10:49 [ipam] Starting source azure.
2018/08/30 12:10:49 [ipam] Refreshing address source.
2018/08/30 12:10:49 [ipam] Source refresh failed, err:expected element type <Interfaces> but have <html>.
2018/08/30 12:10:49 [ipam] Requesting address with address: options:map[].
2018/08/30 12:10:49 [ipam] Address request completed with address:10.240.0.146/12 err:<nil>.
2018/08/30 12:10:49 [ipam] Save succeeded.
2018/08/30 12:10:49 [cni-ipam] Allocated address 10.240.0.146/12.
2018/08/30 12:10:49 [cni-ipam] ADD command completed with result:IP:[{Version:4 Interface:<nil> Address:{IP:10.240.0.146 Mask:fff00000} Gateway:10.240.0.1}], Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} GW:10.240.0.1}], DNS:{Nameservers:[168.63.129.16] Domain: Search:[] Options:[]} err:<nil>.
2018/08/30 12:10:49 [cni-ipam] Plugin stopped.
and azure-vnet.log
that occured yesterday and may be related
2018/08/30 12:10:28 [cni-net] Plugin azure-vnet version v1.0.10.
2018/08/30 12:10:28 [cni-net] Running on windows
2018/08/30 12:10:28 [net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:2f:bd:ac Flags:up|broadcast|multicast} with IP addresses: [fe80::e59e:d65:cb33:6df7/64 10.240.0.127/12]
2018/08/30 12:10:28 [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]
2018/08/30 12:10:28 [net] Network interface: {Index:10 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:08:4a:d8 Flags:up|broadcast|multicast} with IP addresses: [fe80::7428:956b:359:c173/64 172.25.112.1/20]
2018/08/30 12:10:28 os.stat() for file azure-vnet.json failed with error CreateFile azure-vnet.json: The system cannot find the file specified.
2018/08/30 12:10:28 [net] Restored state, &{Version:v1.0.10 TimeStamp:0001-01-01 00:00:00 +0000 UTC ExternalInterfaces:map[] store:0xc0421326f0 Mutex:{state:0 sema:0}}
2018/08/30 12:10:28 [cni-net] Plugin started.
@msuiche Can you please attach the complete log files azure-vnet.log and azure-vnet-ipam.log We would like to take a look at the time when you started seeing the file lock error. It seems like there is a corner case where the code is not cleaning up the lock file (most likely a failure case). Note that the log file with relevant time stamp may be suffixed with .1, .2, etc.
Thanks.
Please find the requested file attached. Thanks. azure-vnet-ipam.log azure-vnet.log
@sharmasushant do you need more info here?
I see this when multiple containers are starting at the same time. Have you considered adding a retry/backoff loop to avoid returning this on overlapped CNI calls?
Also #194 may be same issue
@PatrickLang Yes, this problem is definitely a regression we didn't have it before.
@musiche the below logs that you pasted in the issue are not here in the attached files
2018/08/31 04:37:22 [cni] Timed out on locking store, err:Store is locked. 2018/08/31 04:37:22 Failed to initialize key-value store of network plugin, err:Store is locked. 2018/08/31 04:37:22 Report plugin error
Can you please share the azure-vnet.log.* files with more recent timestamps? I want to take a look at the logs when the issue started happening. The log file that you have attached is only upto 2018/08/30 12:10:28
@sharmasushant The messages are from kubelet.err.log
cni.go:280
fails.
E0917 08:00:10.188905 3540 cni.go:280] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0917 08:00:10.188905 3540 remote_runtime.go:115] StopPodSandbox "515c41fd6c901e7323d1aa4b8feea46e05002ee95cf9a753d2e25045f85dbb73" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-hvxtq_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0917 08:00:10.188905 3540 kuberuntime_gc.go:153] Failed to stop sandbox "515c41fd6c901e7323d1aa4b8feea46e05002ee95cf9a753d2e25045f85dbb73" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "omega-75977c8875-hvxtq_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
I0917 08:00:10.310902 3540 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "83903k8s9010"
2018/09/17 08:00:12 [cni] Timed out on locking store, err:Store is locked.
2018/09/17 08:00:12 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/09/17 08:00:12 Report plugin error
Some more information:
PS C:\k> .\azurecni\bin\azure-vnet.exe --version
2018/09/17 08:10:11 [cni] Timed out on locking store, err:Store is locked.
2018/09/17 08:10:11 Failed to initialize key-value store of network plugin, err:Store is locked.
2018/09/17 08:10:11 Report plugin error
2018/09/17 08:10:11 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/09/17 08:10:11 "Start Flag false CniSucceeded false Name CNI Version v1.0.10 ErrorMessage Store is locked vnet []
Context AzureCNI SubContext "
2018/09/17 08:10:11 OrchestratorDetails &{ kubectl command failed due to exit status 1}
2018/09/17 08:10:11 OSDetails &{windows }
2018/09/17 08:10:11 SystemDetails &{0 0 0 0 0 0 }
2018/09/17 08:10:11 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.127 00:0d:3a:2f:bd:ac vEthernet (Ethernet 2) 30 0 }
2018/09/17 08:10:11 BridgeDetails <nil>
2018/09/17 08:10:11 Send telemetry success 200
2018/09/17 08:10:11 SetReportState succeeded
PS C:\k> .\azurecni\bin\azure-vnet-ipam.exe --version
CNI_COMMAND env variable missing
v1.0.10
{
"code": 100,
"msg": "required env variables missing"
}
PS C:\k> docker version
Client:
Version: 17.06.2-ee-16
API version: 1.30
Go version: go1.8.7
Git commit: 9ef4f0a
Built: Thu Jul 26 16:43:19 2018
OS/Arch: windows/amd64
Server:
Engine:
Version: 17.06.2-ee-16
API version: 1.30 (minimum version 1.24)
Go version: go1.8.7
Git commit: 9ef4f0a
Built: Thu Jul 26 16:52:17 2018
OS/Arch: windows/amd64
Experimental: false
PS C:\k> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
PS C:\k>
@sharmasushant - do you need any more info on this failure? Is there any ETA for a fix?
As far as I understand this is far to be an isolated issue as I've seen multiple issues with it. It's 100% reproducible on our side, we basically can't start a Windows container at all. I'm sure this impacts a lot of customers.
@msuiche we can try to reproduce this on our side. The logs that you are attaching do not have the error. To unblock yourself, you can delete azure-vnet.json.lock file
@sharmasushant - do you need any more info on this failure? Is there any ETA for a fix?
@PatrickLang We can do the change to add more retries/adjust timeout to accommodate multiple parallel POD deployments. However, this issue looks a little different that we have not yet repro'ed.
I don't know a lot about CNI (well anything), but there's certainly a few issues in this repo reading the code. I've put together some fixes in https://github.com/jhowardmsft/azure-container-networking/commit/a3868463451abb1b041915d856758ef761950eb5 and asked @PatrickLang to verify that internally before I put a PR in here.
@jhowardmsft PR are always welcome. Thanks for your effort on working on these!
@sharmasushant How can I build this in WSL? I'm trying to get azure-vnet.exe and azure-vnet-ipam.exe for Patrick, but make all-binaries
in WSL is only generating Linux variants, although does build azure-vnet-plugin.exe:
.
./linux_amd64
./linux_amd64/cni
./linux_amd64/cni/10-azure.conflist
./linux_amd64/cni/azure-vnet
./linux_amd64/cni/azure-vnet-cni-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/cni/azure-vnet-ipam
./linux_amd64/cni-multitenancy
./linux_amd64/cni-multitenancy/10-azure.conflist
./linux_amd64/cni-multitenancy/azure-vnet
./linux_amd64/cni-multitenancy/azure-vnet-cni-multitenancy-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/cni-multitenancy/azure-vnet-ipam
./linux_amd64/cnm
./linux_amd64/cnm/azure-vnet-cnm-linux-amd64-v1.0.12-rc3-8-g009ebee-dirty.tgz
./linux_amd64/cnm/azure-vnet-cnm-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/cnm/azure-vnet-plugin
./linux_amd64/cns
./linux_amd64/cns/azure-cns
./linux_amd64/cns/azure-cns-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./linux_amd64/npm
./linux_amd64/npm/azure-npm
./linux_amd64/npm/azure-npm-linux-amd64-v1.0.12-rc3-9-ga386846.tgz
./windows_amd64
./windows_amd64/cnm
./windows_amd64/cnm/azure-vnet-plugin.exe
NVM, figured it :)
Submitted https://github.com/Azure/azure-container-networking/pull/247 to fix this.
Awesome. Will try this ASAP.
@sharmasushant tried deleting the file as you suggested and I'm bumping into the
https://github.com/kubernets/kubernetes/issues/66087 issue that @PatrickLang reported few weeks back.
E0924 08:06:31.859825 5104 helpers.go:735] eviction manager: failed to construct signal: "allocatableMemory.available" error: system container "pods" not found in metrics
W0924 08:06:31.859825 5104 helpers.go:808] eviction manager: no observation found for eviction signal nodefs.inodesFree
I0924 08:06:31.988710 5104 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "83903k8s9010"
E0924 08:06:36.342989 5104 perfcounter_nodestats.go:144] Unable to get cpu perf counter data; err: unable to collect data from counter. Error code is 800007d6
E0924 08:06:41.463036 5104 helpers.go:735] eviction manager: failed to construct signal: "allocatableMemory.available" error: system container "pods" not found in metrics
W0924 08:06:41.464036 5104 helpers.go:808] eviction manager: no observation found for eviction signal nodefs.inodesFree
I0924 08:06:41.602036 5104 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "83903k8s9010"
And because of
https://github.com/kubernetes/kubernetes/issues/65713
E0924 08:06:22.156451 5104 kubelet_network.go:102] Failed to ensure that nat chain KUBE-MARK-DROP exists: error creating chain "KUBE-MARK-DROP": executable file not found in %PATH%:
My containers are still Pending
:(
Does @jhowardmsft fix also includes fixes to those?
@msuiche - both of those errors are just noise in the logs. They won't prevent the pod from starting. Both are fixed in v1.12.
What does kubectl describe pod …
show for that pod? Any other errors in the log?
Name: omega-7fbbb7d79b-blh85
Namespace: default
Node: <none>
Labels: pod-template-hash=3966638356
service=omega
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/omega-7fbbb7d79b
Containers:
omega:
Image: **********
Port: 3000/TCP
Liveness: http-get http://:3000/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
Environment:
Conditions:
Type Status
PodScheduled False
Volumes:
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=windows
Tolerations: dedicated=windows:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m (x293631 over 6d) default-scheduler 0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match node selector.
socotra:ergosphere msuiche$
@PatrickLang
Is this a request for help?:
Yes.
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE.
Which release version?:
Plugin azure-vnet version v1.0.10.
Which component (CNI/IPAM/CNM/CNS):
CNI
Which Operating System (Linux/Windows):
Windows Server 1803
Which Orchestrator and version (e.g. Kubernetes, Docker)
What happened: Windows containers can't start.
What you expected to happen:
No errors.
How to reproduce it (as minimally and precisely as possible):
We just have an hybrid cluster, with one Windows node. This consistantly happens each time we start a container. This problem appeared when we migrated to Windows Server 1803.
Anything else we need to know:
Two related issues that have been open for several months now. https://github.com/Azure/azure-container-networking/issues/145 https://github.com/Azure/azure-container-networking/issues/232