Nodes goes in Not Ready State after load testing #102

Closed bytesofdhiren closed 3 years ago

bytesofdhiren commented 6 years ago

I was doing load testing in the AKS cluster. Many time, after firing heavy load on the cluster, the nodes are going in the "Not Ready" state and never returns to "Ready" state.

What is a resolution to this problem? How can I bring back the nodes?

slack commented 6 years ago

Few questions. 1) What type of load testing were you running? Were you putting pressure on the cloud provider (adding/removing load balancers, provisioning/detaching disks) 2) Do you have logs from the kubelets in your cluster? Would be curious to see if they logged any errors around connectivity to the apiserver

bytesofdhiren commented 6 years ago

It was intensive memory and CPU operation. After that, I added a memory & CPU limit on all the pods and that issue is not reproducing anymore. But in any case, once the pod is in "Ready" state it should never go to "NotReady" state.

I don't have log of it since it was not responsive and have to delete it.

marcel-dempers commented 6 years ago

I have this exact issue. Easy to produce. I run an AKS cluster 1 with a pod with external IP exposed by K8 service. I run another AKS 2 where I run JMeter. From the JMeter AKS 2, I hit 1500 requests \ sec to the AKS 1 where my service lives and the nodes become not ready :

kubectl get nodes
NAME                       STATUS           AGE       VERSION
aks-nodepool1-35008895-0   NotReady,agent   24d       v1.8.2
aks-nodepool1-35008895-1   NotReady,agent   24d       v1.8.2

When I use kubectl describe I see following conditions on the node:

  Type          Status      LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----          ------      -----------------           ------------------          ------              -------
  NetworkUnavailable    False       Fri, 19 Jan 2018 12:53:00 +1100     Fri, 19 Jan 2018 12:53:00 +1100     RouteCreated            RouteController created a route
  OutOfDisk         False       Mon, 12 Feb 2018 17:49:42 +1100     Fri, 19 Jan 2018 12:52:35 +1100     KubeletHasSufficientDisk    kubelet has sufficient disk space available
  MemoryPressure    Unknown     Mon, 12 Feb 2018 17:49:42 +1100     Mon, 12 Feb 2018 17:50:40 +1100     NodeStatusUnknown       Kubelet stopped posting node status.
  DiskPressure      Unknown     Mon, 12 Feb 2018 17:49:42 +1100     Mon, 12 Feb 2018 17:50:40 +1100     NodeStatusUnknown       Kubelet stopped posting node status.
  Ready         Unknown     Mon, 12 Feb 2018 17:49:42 +1100     Mon, 12 Feb 2018 17:50:40 +1100     NodeStatusUnknown       Kubelet stopped posting node status.

I also am in the process of adding restrictions to deployment resources, but I just thought the cluster should still recover after such a scenario.

I have added events json from kubectl cluster-info dump command.

Unable to get anything out of heapster

Error from server (BadRequest): the server rejected our request for an unknown reason (get pods heapster-75667786bb-rtl4r)

kube-system status:

kubectl get pods -n kube-system
NAME                                    READY     STATUS     RESTARTS   AGE
heapster-75667786bb-rtl4r               2/2       Unknown    6          24d
heapster-75667786bb-vkcs9               0/2       Pending    0          25m
kube-dns-v20-6c8f7f988b-ggm8w           0/3       Pending    0          25m
kube-dns-v20-6c8f7f988b-grflt           3/3       Unknown    9          24d
kube-dns-v20-6c8f7f988b-mzqbk           3/3       Unknown    9          24d
kube-dns-v20-6c8f7f988b-z2bch           0/3       Pending    0          25m
kube-proxy-hchrl                        1/1       NodeLost   3          24d
kube-proxy-xckjx                        1/1       NodeLost   3          24d
kube-svc-redirect-ht4bm                 1/1       NodeLost   44         24d
kube-svc-redirect-kvkv9                 1/1       NodeLost   45         24d
kubernetes-dashboard-6fc8cf9586-lvp8n   1/1       Unknown    47         24d
kubernetes-dashboard-6fc8cf9586-qdnh8   0/1       Pending    0          25m
tunnelfront-654c57cd9c-4x2zk            0/1       Pending    0          25m
tunnelfront-654c57cd9c-n7g87            1/1       Unknown    3          24d

Hope this info helps you guys troubleshoot further if needed.

dshimko commented 6 years ago

Same issue after overloading on number of replicas that would fit the amount of available RAM.

  Type                  Status          LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------          -----------------                       ------------------                      ------                          -------
  NetworkUnavailable    False           Fri, 19 Jan 2018 18:26:06 +0000         Fri, 19 Jan 2018 18:26:06 +0000         RouteCreated                    RouteController created a route
  OutOfDisk             False           Mon, 19 Feb 2018 21:34:46 +0000         Fri, 19 Jan 2018 18:25:43 +0000         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        Unknown         Mon, 19 Feb 2018 21:34:46 +0000         Mon, 19 Feb 2018 21:35:39 +0000         NodeStatusUnknown               Kubelet stopped posting node status.
  DiskPressure          Unknown         Mon, 19 Feb 2018 21:34:46 +0000         Mon, 19 Feb 2018 21:35:39 +0000         NodeStatusUnknown               Kubelet stopped posting node status.
  Ready                 Unknown         Mon, 19 Feb 2018 21:34:46 +0000         Mon, 19 Feb 2018 21:35:39 +0000         NodeStatusUnknown               Kubelet stopped posting node status.
mooperd commented 6 years ago

One of my nodes seems to be hitting a similar problem.

meow@kubrick:~/dev/kube-system$ kubectl get nodes
NAME                       STATUS     ROLES     AGE       VERSION
aks-nodepool1-34207704-0   NotReady   agent     2d        v1.8.7
aks-nodepool1-34207704-1   Ready      agent     2d        v1.8.7
aks-nodepool1-34207704-2   Ready      agent     2d        v1.8.7
meow@kubrick:~/dev/kube-system$ kubectl describe node aks-nodepool1-34207704-0
Name:               aks-nodepool1-34207704-0
Roles:              agent
Labels:             agentpool=nodepool1
Taints:             <none>
CreationTimestamp:  Sun, 18 Mar 2018 13:13:30 +0100
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason                     Message
  ----                 ------    -----------------                 ------------------                ------                     -------
  NetworkUnavailable   False     Sun, 18 Mar 2018 13:15:06 +0100   Sun, 18 Mar 2018 13:15:06 +0100   RouteCreated               RouteController created a route
  OutOfDisk            False     Wed, 21 Mar 2018 12:16:48 +0100   Sun, 18 Mar 2018 13:13:30 +0100   KubeletHasSufficientDisk   kubelet has sufficient disk space available
  MemoryPressure       Unknown   Wed, 21 Mar 2018 12:16:48 +0100   Wed, 21 Mar 2018 12:17:29 +0100   NodeStatusUnknown          Kubelet stopped posting node status.
  DiskPressure         Unknown   Wed, 21 Mar 2018 12:16:48 +0100   Wed, 21 Mar 2018 12:17:29 +0100   NodeStatusUnknown          Kubelet stopped posting node status.
  Ready                Unknown   Wed, 21 Mar 2018 12:16:48 +0100   Wed, 21 Mar 2018 12:17:29 +0100   NodeStatusUnknown          Kubelet stopped posting node status.
  Hostname:    aks-nodepool1-34207704-0
Capacity:  0
 cpu:                             1
 memory:                          3501600Ki
 pods:                            110
Allocatable:  0
 cpu:                             1
 memory:                          3399200Ki
 pods:                            110
System Info:
 Machine ID:                 833e0926ee21aed71ec075d726cbcfe0
 System UUID:                8831AD2D-F08D-B646-BF5D-8BE8223630A4
 Boot ID:                    23445b11-3f5d-4a59-82d9-da2ef2ee25a6
 Kernel Version:             4.13.0-1007-azure
 OS Image:                   Debian GNU/Linux 8 (jessie)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://1.13.1
 Kubelet Version:            v1.8.7
 Kube-Proxy Version:         v1.8.7
ExternalID:                  2dad3188-8df0-46b6-bf5d-8be8223630a4
Non-terminated Pods:         (11 in total)
  Namespace                  Name                               CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                               ------------  ----------  ---------------  -------------
  es                         elasticsearch-logging-v1-6pj9k     100m (10%)    1 (100%)    0 (0%)           0 (0%)
  es                         kibana-logging-6c56bdff64-wlgnx    100m (10%)    100m (10%)  0 (0%)           0 (0%)
  kube-system                fluentd-rkbz8                      100m (10%)    0 (0%)      200Mi (6%)       200Mi (6%)
  kube-system                kube-dns-v20-5bf84586f4-6bpw8      110m (11%)    0 (0%)      120Mi (3%)       220Mi (6%)
  kube-system                kube-dns-v20-5bf84586f4-fxhd8      110m (11%)    0 (0%)      120Mi (3%)       220Mi (6%)
  kube-system                kube-proxy-fttkb                   100m (10%)    0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-svc-redirect-fhhkc            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  test-app-two               test-app-two-659cb68964-wchcc      0 (0%)        0 (0%)      0 (0%)           0 (0%)
  test-app                   test-app-68f4cc5d94-7gk8h          0 (0%)        0 (0%)      0 (0%)           0 (0%)
  test-fecore                test-fecore-5647458597-s54lw       0 (0%)        0 (0%)      0 (0%)           0 (0%)
  test-fecore                test-fecore-857d988ff9-9sxl9       0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits    Memory Requests  Memory Limits
  ------------  ----------    ---------------  -------------
  620m (62%)    1100m (110%)  440Mi (13%)      640Mi (19%)
  Type     Reason                            Age                  From                               Message
  ----     ------                            ----                 ----                               -------
  Warning  FailedNodeAllocatableEnforcement  40m (x4264 over 2d)  kubelet, aks-nodepool1-34207704-0  Failed to update Node Allocatable Limits "": failed to set supported cgroup subsystems for cgroup : Failed to set config for supported subsystems : failed to write 3585638400 to memory.limit_in_bytes: write /var/lib/docker/overlay2/463cfcf6aa43fd385982d198b7bf929b52b7168494235c87153516bffcfebc38/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: invalid argument
rfum commented 6 years ago

I'm having the same error with my nodes.

What I did :

So here is some logs about my issue :

$ kubectl get nodes
NAME                       STATUS     ROLES     AGE       VERSION
aks-agentpool-23876029-0   NotReady   agent     15h       v1.8.7
aks-agentpool-23876029-1   NotReady   agent     15h       v1.8.7
aks-agentpool-23876029-2   NotReady   agent     15h       v1.8.7

Here one of my nodes description:

$ kubectl describe node aks-agentpool-23876029-0
Name:               aks-agentpool-23876029-0
Roles:              agent
Labels:             agentpool=agentpool
Taints:             <none>
CreationTimestamp:  Sat, 07 Apr 2018 01:44:09 +0300
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Sat, 07 Apr 2018 01:44:22 +0300   Sat, 07 Apr 2018 01:44:22 +0300   RouteCreated                 RouteController created a route
  OutOfDisk            False   Sat, 07 Apr 2018 17:07:35 +0300   Sat, 07 Apr 2018 01:44:09 +0300   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Sat, 07 Apr 2018 17:07:35 +0300   Sat, 07 Apr 2018 17:07:35 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 07 Apr 2018 17:07:37 +0300   Sat, 07 Apr 2018 17:07:37 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready                False   Sat, 07 Apr 2018 17:07:37 +0300   Sat, 07 Apr 2018 17:07:37 +0300   KubeletNotReady              container runtime is down,PLEG is not healthy: pleg was last seen active 4m12.549011207s ago; threshold is 3m0s
  Hostname:    aks-agentpool-23876029-0
Capacity:  0
 cpu:                             1
 memory:                          921108Ki
 pods:                            110
Allocatable:  0
 cpu:                             1
 memory:                          818708Ki
 pods:                            110
System Info:
 Machine ID:                 2c32d3297bfd44c5a577c9f5a562fb1d
 System UUID:                4875273C-34B3-1A42-8A86-EF22E3124ED4
 Boot ID:                    e8b22273-9de7-464e-83ad-6cc6c35da6c2
 Kernel Version:             4.13.0-1011-azure
 OS Image:                   Debian GNU/Linux 8 (jessie)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://1.13.1
 Kubelet Version:            v1.8.7
 Kube-Proxy Version:         v1.8.7
ExternalID:                  3c277548-b334-421a-8a86-ef22e3124ed4
Non-terminated Pods:         (18 in total)
  Namespace                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                     ------------  ----------  ---------------  -------------
  default                    webserver-5d6cdf9d96-25dvc               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-869sm               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-brtb9               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-bvpvt               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-cllt8               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-fgk4j               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-kc2km               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-msjds               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-swgf2               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-5d6cdf9d96-z7p54               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    webserver-6d656d4d54-drg8j               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                heapster-75f8df9884-jgp9k                138m (13%)    138m (13%)  294Mi (36%)      294Mi (36%)
  kube-system                kube-dns-v20-5bf84586f4-4m4xp            110m (11%)    0 (0%)      120Mi (15%)      220Mi (27%)
  kube-system                kube-dns-v20-5bf84586f4-z2nr4            110m (11%)    0 (0%)      120Mi (15%)      220Mi (27%)
  kube-system                kube-proxy-ntfw9                         100m (10%)    0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-svc-redirect-ghnxw                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kubernetes-dashboard-665f768455-pvql2    100m (10%)    100m (10%)  50Mi (6%)        50Mi (6%)
  kube-system                tunnelfront-88b6d8ddc-stw24              0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  558m (55%)    238m (23%)  584Mi (73%)      784Mi (98%)
  Type    Reason                   Age                 From                               Message
  ----    ------                   ----                ----                               -------
  Normal  NodeNotReady             38m (x2 over 42m)   kubelet, aks-agentpool-23876029-0  Node aks-agentpool-23876029-0 status is now: NodeNotReady
  Normal  NodeHasSufficientMemory  35m (x23 over 15h)  kubelet, aks-agentpool-23876029-0  Node aks-agentpool-23876029-0 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    35m (x23 over 15h)  kubelet, aks-agentpool-23876029-0  Node aks-agentpool-23876029-0 status is now: NodeHasNoDiskPressure
  Normal  NodeReady                35m (x6 over 15h)   kubelet, aks-agentpool-23876029-0  Node aks-agentpool-23876029-0 status is now: NodeReady

nic logs :

$ ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:0d:3a:1e:00:84
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::20d:3aff:fe1e:84/64 Scope:Link
          RX packets:2351202 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1452121 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2044348334 (2.0 GB)  TX bytes:257285038 (257.2 MB)

journalctl records about the service

$journalctl -u kubelet
Hint: You are currently not seeing messages from other users and the system.
      Users in the 'systemd-journal' group can see all messages. Pass -q to
      turn off this notice.
-- No entries --

and a snapshot of resource usage of the node screen shot 2018-04-07 at 5 21 39 pm

seanknox commented 5 years ago

Closing due inactivity. Feel free to re-open if still an issue.

akolodkin commented 5 years ago

We are experiencing same behavior, cluster is loosing nodes due to load. especially with 1 core setup. DS1 VM

DenisBiondic commented 5 years ago

Same here -> multiple nodes in Not Ready status, presumed since we don't have any resource quotas on pods yet

kubectl describe node shows

  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----             ------    -----------------                 ------------------                ------                    -------
  OutOfDisk        Unknown   Tue, 07 Aug 2018 20:05:51 +0200   Tue, 07 Aug 2018 20:06:35 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Tue, 07 Aug 2018 20:05:51 +0200   Tue, 07 Aug 2018 20:06:35 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 07 Aug 2018 20:05:51 +0200   Tue, 07 Aug 2018 20:06:35 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Tue, 07 Aug 2018 20:05:51 +0200   Fri, 03 Aug 2018 00:14:57 +0200   KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            Unknown   Tue, 07 Aug 2018 20:05:51 +0200   Tue, 07 Aug 2018 20:06:35 +0200   NodeStatusUnknown         Kubelet stopped posting node status.

Nodes themselves:

aks-nodepool1-37134528-0 NotReady agent 4d v1.10.6 aks-nodepool1-37134528-1 NotReady agent 4d v1.10.6 aks-nodepool1-37134528-2 NotReady agent 4d v1.10.6

What is spooky is that nodes are exactly at 20:05 every evening "Not Ready", and they are at 08:00 in the morning back in ready state.

mappindrones commented 5 years ago

Experiencing a similar issue in cluster built through acs-engine: agent v1.10.1 Ubuntu 16.04.4 LTS

Experiencing all manner of instability - 504s, containers losing connectivity to db, random container restarts. We've applied the azure-cni-networkmonitor daemonset "patch" but still experiencing a high level of networking issues.

Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.088277] IPv6: ADDRCONF(NETDEV_UP): azveth592e4ac-2: link is not ready
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.088302] IPv6: ADDRCONF(NETDEV_CHANGE): azveth592e4ac-2: link becomes ready
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.088359] IPv6: ADDRCONF(NETDEV_CHANGE): azveth592e4ac: link becomes ready
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.088375] azure0: port 8(azveth592e4ac) entered blocking state
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.088377] azure0: port 8(azveth592e4ac) entered forwarding state
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.088954] azure0: port 8(azveth592e4ac) entered disabled state
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.089055] eth0: renamed from azveth592e4ac-2
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.112188] azure0: port 8(azveth592e4ac) entered blocking state
Sep 18 06:32:07 k8s-backup-37692245-8 kernel: [25553.112191] azure0: port 8(azveth592e4ac) entered forwarding state
Sep 18 06:32:43 k8s-backup-37692245-8 kernel: [25589.545719] azure0: port 8(azveth592e4ac) entered disabled state
Sep 18 06:32:43 k8s-backup-37692245-8 kernel: [25589.546158] device azveth592e4ac left promiscuous mode
Sep 18 06:32:43 k8s-backup-37692245-8 kernel: [25589.546179] azure0: port 8(azveth592e4ac) entered disabled state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.411464] IPv6: ADDRCONF(NETDEV_UP): azveth161ae8b: link is not ready
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.411977] azure0: port 8(azveth161ae8b) entered blocking state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.411978] azure0: port 8(azveth161ae8b) entered disabled state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.412104] device azveth161ae8b entered promiscuous mode
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.452272] IPv6: ADDRCONF(NETDEV_UP): azveth161ae8b-2: link is not ready
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.452281] IPv6: ADDRCONF(NETDEV_CHANGE): azveth161ae8b-2: link becomes ready
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.452370] IPv6: ADDRCONF(NETDEV_CHANGE): azveth161ae8b: link becomes ready
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.452384] azure0: port 8(azveth161ae8b) entered blocking state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.452387] azure0: port 8(azveth161ae8b) entered forwarding state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.453055] azure0: port 8(azveth161ae8b) entered disabled state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.453182] eth0: renamed from azveth161ae8b-2
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.500179] azure0: port 8(azveth161ae8b) entered blocking state
Sep 18 06:50:05 k8s-backup-37692245-8 kernel: [26631.500182] azure0: port 8(azveth161ae8b) entered forwarding state
Sep 18 06:50:44 k8s-backup-37692245-8 kernel: [26669.830613] azure0: port 8(azveth161ae8b) entered disabled state
Sep 18 06:50:44 k8s-backup-37692245-8 kernel: [26669.831078] device azveth161ae8b left promiscuous mode
Sep 18 06:50:44 k8s-backup-37692245-8 kernel: [26669.831116] azure0: port 8(azveth161ae8b) entered disabled state
Sep 18 06:54:08 k8s-backup-37692245-8 kernel: [26873.581141] IPv6: ADDRCONF(NETDEV_UP): azvethdf7a741: link is not ready
Sep 18 06:54:08 k8s-backup-37692245-8 kernel: [26873.581615] azure0: port 8(azvethdf7a741) entered blocking state
Sep 18 06:54:08 k8s-backup-37692245-8 kernel: [26873.581616] azure0: port 8(azvethdf7a741) entered disabled state
Sep 18 06:54:08 k8s-backup-37692245-8 kernel: [26873.581678] device azvethdf7a741 entered promiscuous mode```

IFconfig shows packets being dropped 

azure0    Link encap:Ethernet  HWaddr 00:0d:3a:06:b7:a1
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::20d:3aff:fe06:b7a1/64 Scope:Link
          RX packets:1396090 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1490274 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5380455646 (5.3 GB)  TX bytes:1404949910 (1.4 GB)

azveth2396295 Link encap:Ethernet  HWaddr 62:e2:52:71:ee:3d
          inet6 addr: fe80::60e2:52ff:fe71:ee3d/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:4567643 errors:0 dropped:19276 overruns:0 frame:0
          TX packets:2407275 errors:0 dropped:11 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2123881352 (2.1 GB)  TX bytes:2065597278 (2.0 GB)

azveth21783c1 Link encap:Ethernet  HWaddr 6a:58:58:71:62:86
          inet6 addr: fe80::6858:58ff:fe71:6286/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:1732555 errors:0 dropped:2 overruns:0 frame:0
          TX packets:1417395 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17552342445 (17.5 GB)  TX bytes:4175873468 (4.1 GB)
jnoller commented 5 years ago

PLEG Unhealthy is a known defect in Kubernetes upstream with patches looking like they will land in k8s 1.16:

mingtwan-zz commented 4 years ago

We are experiencing the same issue when I deployed some statefulSet which contains some PVC, seems disk provisioning caused this problem, in the node's "Resource health" page it says "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk"

mooperd commented 4 years ago

I'm fairly convinced this problem is due to disk performance getting so slow that the nodes can't write logs fast enough.

On Wed, 30 Oct 2019, 03:34 mingtwan, wrote:

We are experiencing the same issue when I deployed some statefulSet which contains some PVC, seems disk provisioning caused this problem, in the node's "Resource health" page it says "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe .

jnoller commented 4 years ago

@mooperd you’re probably right. If you have a cluster with a 1 tb disk, that I think is a p4 class premium disk with a maximum iops of 200, which means that the OS disk IO is so high due to disk IO contention this occurs.

mooperd commented 4 years ago

I think its more a problem of the underlying infrastructure. I have seen disk r/w down to 10Kb/s on Azure instances.

Lower standard IOPS on nodes wouldn't necessarily be a problem as the writing wouldn't be synchronous - the operating system would stream them to disk.

Anyone seeing this issue should have a look at top on all the nodes (not just the one affected) and see if any of the processes are spending a lot of time in 'wait'.

On Wed, 30 Oct 2019, 12:08 Jesse Noller, wrote:

@mooperd you’re probably right. If you have a cluster with a 1 tb disk, that I think is a p4 class premium disk with a maximum iops of 200, which means that the OS disk IO is so high due to disk IO contention this occurs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe .

jnoller commented 4 years ago

@mooperd I've been debugging this for the service. You're right in most normal IO cases, and top will not show the underlying throttling of the OS disk. Go back to my example:

Node, 1tb OS disk, running linux.

A 1tb disk on that page has a maximum IO - but it also has a maximum IOPs - 5000 Max Iops per disk, and that is your OS disk

Now, factor in the size of the containers - larger docker containers have worse disk IO transactions. The Azure system detects anything doing 256KiB of IO as an IOP

Now on the OS disk you have the docker daemon, kubelet, in memory FS drivers (say cifs, etc).

Looking at kube-metrics data only shows the in-memory kube object view and now OS/Docker level. which means it's short the system level IO calls.

This means in addition to the normal VM limitations - you also have the cache limit / max etc:

disk sizes

jnoller commented 4 years ago

Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373

ghost commented 3 years ago

palma21 commented 3 years ago

Sorry about the spam. The bot issue should be fixed now.

A lot of the issues on this ticket have been solved or mitigated upstream or in recent versions of AKS.

We realize perhaps not all problems have been addressed since the thread is running fairly long. If you can kindly open a ticket with your specific issue we can look into it.

Please also refer to recent features that will add increased stability and resilience:

1) 2) 3) 4) K-node which can be provided by the platform by support request and will be exposed on the API soon: