Node becomes NotReady - Githubissues

agcooke commented 6 years ago

We are running EKS in Ireland and our nodes are going unhealthy regularly.

It is not possible to SSH to the host, pods are not reachable. We have experienced this with t2.xlarge, t2.small and t3.medium instances.

We could ssh to another node in the cluster and ping the NotReady node, but are not able to ssh it either.

Graphs show the memory goes high at about the same time that the journalctl logs below. The EBS IO also goes high. The exact time is hard to pinpoint. I added logs with interesting 'failures' around the time that we think the node disappeared.

We are using the cluster for running tests, so pods are getting created and destroyed often.

We have not done anything described in https://github.com/awslabs/amazon-eks-ami/issues/51 for log rotation.

Cluster Information: CNI: Latest daemonset with image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.2.1 Region: eu-west-1

LOGS

Node AMI

AMI ID amazon-eks-node-v24 (ami-0c7a4976cb6fafd3a)

File system

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1,9G     0  1,9G   0% /dev
tmpfs           1,9G     0  1,9G   0% /dev/shm
tmpfs           1,9G  2,3M  1,9G   1% /run
tmpfs           1,9G     0  1,9G   0% /sys/fs/cgroup
/dev/nvme0n1p1   64G   40G   25G  62% /
tmpfs           389M     0  389M   0% /run/user/1000

kubectl describe node

Name: ip-<secret>.eu-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1b
kubernetes.io/hostname=ip-<secret>.eu-west-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 30 Oct 2018 11:25:48 +0100
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Wed, 31 Oct 2018 10:56:53 +0100 Tue, 30 Oct 2018 11:25:46 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: <secret>
Hostname: ip-<secret>..eu-west-1.compute.internal
Capacity:
cpu: 2
ephemeral-storage: 67096556Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3980344Ki
pods: 17
Allocatable:
cpu: 2
ephemeral-storage: 61836185908
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3877944Ki
pods: 17
System Info:
Machine ID: asdf
System UUID: asdf
Boot ID: asdf
Kernel Version: 4.14.62-70.117.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.6.2
Kubelet Version: v1.10.3
Kube-Proxy Version: v1.10.3
ProviderID: aws:///eu-west-1b/i-<secret>
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system aws-node-hshhg 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-fkrb8 100m (5%) 0 (0%) 0 (0%) 0 (0%)
monitoring datadog-datadog-bk5bd 200m (10%) 200m (10%) 256Mi (6%) 256Mi (6%)
monitoring prometheus-node-exporter-4z2dg 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t1 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t3 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t4 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t5 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t6 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t7 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t8 100m (5%) 0 (0%) 256Mi (6%) 0 (0%)
t9 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t10 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1160m (57%) 950m (47%)
memory 2012Mi (53%) 1756Mi (46%)
Events: <none>

journalctl logs around the time

okt 31 10:01:29 ip-<secret>.eu-west-1.compute.internal kernel: aws-k8s-agent: page allocation stalls for 10404ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
okt 31 10:01:30 ip-<secret>.eu-west-1.compute.internal kernel: aws-k8s-agent cpuset=1ef2c300b3981b045f3f2fcab050f674afead7e7c828362ec2d40ef82bf02441 mems_allowed=0
okt 31 10:01:31 ip-<secret>.eu-west-1.compute.internal kernel: CPU: 1 PID: 6267 Comm: aws-k8s-agent Not tainted 4.14.62-70.117.amzn2.x86_64 #1
okt 31 10:01:34 ip-<secret>.eu-west-1.compute.internal kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
okt 31 10:01:36 ip-<secret>.eu-west-1.compute.internal kernel: Call Trace:
okt 31 10:01:38 ip-<secret>.eu-west-1.compute.internal kernel: dump_stack+0x5c/0x82
okt 31 10:01:39 ip-<secret>.eu-west-1.compute.internal kernel: warn_alloc+0x114/0x1c0
okt 31 10:01:41 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_slowpath+0x831/0xe00
okt 31 10:01:42 ip-<secret>.eu-west-1.compute.internal kernel: ? get_page_from_freelist+0x371/0xba0
okt 31 10:01:45 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_nodemask+0x227/0x250
okt 31 10:01:46 ip-<secret>.eu-west-1.compute.internal kernel: filemap_fault+0x204/0x5f0
okt 31 10:01:47 ip-<secret>.eu-west-1.compute.internal kernel: __xfs_filemap_fault.constprop.8+0x49/0x120 [xfs]
okt 31 10:01:50 ip-<secret>.eu-west-1.compute.internal kernel: __do_fault+0x20/0x60
okt 31 10:01:52 ip-<secret>.eu-west-1.compute.internal kernel: handle_pte_fault+0x945/0xeb0
okt 31 10:01:55 ip-<secret>.eu-west-1.compute.internal kernel: __handle_mm_fault+0x431/0x540
okt 31 10:01:57 ip-<secret>.eu-west-1.compute.internal kernel: handle_mm_fault+0xaa/0x1e0
okt 31 10:02:00 ip-<secret>.eu-west-1.compute.internal kernel: __do_page_fault+0x23e/0x4c0
okt 31 10:02:02 ip-<secret>.eu-west-1.compute.internal kernel: ? async_page_fault+0x2f/0x50
okt 31 10:02:07 ip-<secret>.eu-west-1.compute.internal kernel: async_page_fault+0x45/0x50
okt 31 10:02:09 ip-<secret>.eu-west-1.compute.internal kernel: RIP: 0001:0x1f
okt 31 10:02:12 ip-<secret>.eu-west-1.compute.internal kernel: RSP: 0000:000000c420170f58 EFLAGS: 4d32dce245d7
okt 31 10:02:15 ip-<secret>.eu-west-1.compute.internal kernel: Mem-Info:
okt 31 10:02:16 ip-<secret>.eu-west-1.compute.internal kernel: active_anon:895836 inactive_anon:8314 isolated_anon:0
active_file:413 inactive_file:596 isolated_file:0
unevictable:0 dirty:1 writeback:0 unstable:0
slab_reclaimable:17241 slab_unreclaimable:26888
mapped:22510 shmem:28069 pagetables:7173 bounce:0
free:21650 free_pcp:12 free_cma:0
okt 31 10:02:17 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 active_anon:3583344kB inactive_anon:33256kB active_file:1652kB inactive_file:2384kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:90040kB dirty:4kB writeback:0kB shmem:112276kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 16384kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
okt 31 10:02:19 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA free:15620kB min:268kB low:332kB high:396kB active_anon:288kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
okt 31 10:02:22 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 2951 3849 3849
okt 31 10:02:24 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA32 free:54972kB min:51600kB low:64500kB high:77400kB active_anon:2799324kB inactive_anon:31696kB active_file:556kB inactive_file:816kB unevictable:0kB writepending:0kB present:3129320kB managed:3044324kB mlocked:0kB kernel_stack:7968kB pagetables:19844kB bounce:0kB free_pcp:148kB local_pcp:0kB free_cma:0kB
okt 31 10:02:27 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 0 898 898
okt 31 10:02:30 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 Normal free:15408kB min:15708kB low:19632kB high:23556kB active_anon:783732kB inactive_anon:1560kB active_file:884kB inactive_file:1392kB unevictable:0kB writepending:4kB present:987136kB managed:920112kB mlocked:0kB kernel_stack:4304kB pagetables:8848kB bounce:0kB free_pcp:420kB local_pcp:0kB free_cma:0kB
okt 31 10:02:32 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 0 0 0
okt 31 10:02:34 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 2*32kB (UM) 3*64kB (UM) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (ME) = 15620kB
okt 31 10:02:36 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA32: 1659*4kB (UME) 1496*8kB (UME) 1181*16kB (UME) 446*32kB (UME) 54*64kB (UME) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55356kB
okt 31 10:02:37 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 Normal: 334*4kB (UMEH) 351*8kB (UMEH) 431*16kB (UMEH) 93*32kB (UMEH) 4*64kB (H) 2*128kB (H) 4*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15552kB
okt 31 10:02:38 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
okt 31 10:02:38 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
okt 31 10:02:40 ip-<secret>.eu-west-1.compute.internal kernel: 28769 total pagecache pages
okt 31 10:02:40 ip-<secret>.eu-west-1.compute.internal kernel: 0 pages in swap cache
okt 31 10:02:42 ip-<secret>.eu-west-1.compute.internal kernel: Swap cache stats: add 0, delete 0, find 0/0
okt 31 10:02:43 ip-<secret>.eu-west-1.compute.internal kernel: Free swap = 0kB
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal systemd-journal[26209]: Permanent journal is using 392.0M (max allowed 4.0G, trying to leave 4.0G free of 38.6G available → current limit 4.0G).
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal kernel: Total swap = 0kB
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal kernel: 1033112 pages RAM
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: 0 pages HighMem/MovableOnly
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: 38026 pages reserved
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: process-agent: page allocation stalls for 10580ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: process-agent cpuset=67b33ad9edc4663ce3e97ac968df4726a9beeff073706349383b1e9eabd93125 mems_allowed=0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: CPU: 1 PID: 7452 Comm: process-agent Not tainted 4.14.62-70.117.amzn2.x86_64 #1
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: Call Trace:
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: dump_stack+0x5c/0x82
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: warn_alloc+0x114/0x1c0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_slowpath+0x831/0xe00
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: ? get_page_from_freelist+0x371/0xba0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_nodemask+0x227/0x250
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: filemap_fault+0x204/0x5f0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __xfs_filemap_fault.constprop.8+0x49/0x120 [xfs]
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __do_fault+0x20/0x60
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: handle_pte_fault+0x945/0xeb0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? __switch_to_asm+0x34/0x70
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? __switch_to_asm+0x40/0x70
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: __handle_mm_fault+0x431/0x540
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: handle_mm_fault+0xaa/0x1e0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: __do_page_fault+0x23e/0x4c0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? async_page_fault+0x2f/0x50
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: async_page_fault+0x45/0x50
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: RIP: 00f1:0x11
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: RSP: 0234:00007f768c4e5d38 EFLAGS: 00000000
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: Mem-Info:
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: active_anon:895956 inactive_anon:8314 isolated_anon:0
active_file:15 inactive_file:24 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:17237 slab_unreclaimable:26796
mapped:21743 shmem:28069 pagetables:7196 bounce:0
free:21560 free_pcp:682 free_cma:0

plugin logs

2018-10-31T10:50:15Z [INFO] Starting CNI Plugin v1.2.1  ...
2018-10-31T10:50:15Z [INFO] Received CNI del request: ContainerID(56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=monitoring;K8S_POD_NAME=datadog-datadog-bk5bd;K8S_POD_INFRA_CONTAINER_ID=56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-10-31T10:50:15Z [ERROR] Error received from DelNetwork grpc call for pod datadog-datadog-bk5bd namespace monitoring container 56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

ipamd.log

2018-10-31T10:05:43Z [DEBUG] Found ip addresses [10.0.1.72 10.0.1.208 10.0.1.36 10.0.1.86 10.0.1.219 10.0.1.63] on eni 02:af:21:3c:f9:4e
2018-10-31T10:05:44Z [DEBUG] Found eni mac address : 02:b3:1a:eb:c3:5e
2018-10-31T10:05:52Z [DEBUG] Using device number 0 for primary eni: eni-0f37efb5e4ebecf09
2018-10-31T10:05:52Z [DEBUG] Found eni: eni-0f37efb5e4ebecf09, mac 02:b3:1a:eb:c3:5e, device 0
2018-10-31T10:05:55Z [DEBUG] Found cidr 10.0.1.0/24 for eni 02:b3:1a:eb:c3:5e
2018-10-31T10:05:59Z [DEBUG] Found ip addresses [10.0.1.143 10.0.1.96 10.0.1.65 10.0.1.209 10.0.1.134 10.0.1.8] on eni 02:b3:1a:eb:c3:5e
2018-10-31T10:05:59Z [DEBUG] Reconcile existing ENI eni-0ce38d7ac411b07ab IP pool
2018-10-31T10:05:59Z [DEBUG] Reconcile and skip primary IP 10.0.1.117 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.53 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.53 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.102 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.102 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.120 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.120 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.42 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.42 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.59 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.59 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Reconcile existing ENI eni-0f1db76fd54b2e3f5 IP pool
2018-10-31T10:06:00Z [DEBUG] Reconcile and skip primary IP 10.0.1.72 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.208 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.208 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.36 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.36 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.86 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.86 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.219 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.219 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.63 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.63 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Reconcile existing ENI eni-0f37efb5e4ebecf09 IP pool
2018-10-31T10:06:00Z [DEBUG] Reconcile and skip primary IP 10.0.1.143 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.96 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.96 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.65 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.65 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.209 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.209 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.134 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.134 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.8 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.8 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Successfully Reconciled ENI/IP pool
2018-10-31T10:06:06Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:06Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:06Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:11Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:12Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:13Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:17Z [INFO]  Pods deleted on my node: t1
2018-10-31T10:06:18Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:19Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:19Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:25Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:25Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:26Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:33Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:34Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:35Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:35Z [DEBUG] Reconciling ENI/IP pool info...
2018-10-31T10:07:19Z [INFO]  Pods deleted on my node: t2
2018-10-31T10:09:29Z [INFO]  Pods deleted on my node: t3
2018-10-31T10:14:30Z [ERROR] Failed to retrieve interfaces data from instance metadata RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-10-31T10:14:37Z [ERROR] ip pool reconcile: Failed to get attached eni infoget attached enis: failed to retrieve interfaces data: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-10-31T10:14:47Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:14:49Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:14:51Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:14:52Z [DEBUG] Reconciling ENI/IP pool info...
2018-10-31T10:14:59Z [INFO]  Pods deleted on my node: t4
2018-10-31T10:15:43Z [DEBUG] Total number of interfaces found: 3 
2018-10-31T10:15:44Z [DEBUG] Found eni mac address : 02:63:a9:60:fc:42

rakeshpatri commented 5 years ago

I am using the eks optimised default ami as per the guide in eks documentation.

jaredeis commented 5 years ago

Are you using the latest one or at least the one that was pushed in early Jan. It has a fix to tell kubelet to restart if it dies from SIGPIPE which happens if journald gets killed by a lack of resources. If you run cat /etc/systemd/system/kubelet.service | grep "RestartForceExitStatus" on one of your worker nodes, you should see RestartForceExitStatus=SIGPIPE as output. If not, then you don't have the fix that should restart kubelet if journald dies.

Manjukb commented 5 years ago

@jaredeis on behalf of @rakeshpatri we tested it i can see RestartForceExitStatus=SIGPIPE as output

what more can we try ??

jaredeis commented 5 years ago

I guess the only other thing I could suggest is opening a case with AWS to see if they can help determine why kubelet is dying on your nodes.

tckb commented 5 years ago

I have been seeing the same issue. we are on t2-medium and one node goes "NotReady" sporadically. The only solution I had was to detach and spawn another node. restart does not work sigh any interim solution for this? @agcooke / others is your fork available somewhere? clearly aws has no time for this one! 😠

agcooke commented 5 years ago

Most of the important changes I made are now merged into the latest AMI.

I would make sure that you are using EBS_Optimized ec2 instances and upgrade your docker version by changing the version in the packer file.

My fork is here https://github.com/agcooke/amazon-eks-ami. I have moved onto another project so have not kept it up to date with the aws image.

rakeshpatri commented 5 years ago

I have been seeing the same issue. we are on t2-medium and one node goes "NotReady" sporadically. The only solution I had was to detach and spawn another node. restart does not work sigh any interim solution for this? @agcooke / others is your fork available somewhere? clearly aws has no time for this one! 😠

We have setup cluster Autoscaler as well as have set resource limits on all the deployments in kubernetes. Also use HPA for your deployment that is consuming more resource. By applying these changes we have not faced this issue anymore.

tckb commented 5 years ago

@agcooke were the change released? we are currently running amazon-eks-node-1.11-v20190220 and unfortunately, the issue remains

whereisaaron commented 5 years ago

@tckb I've seen it suggested some of these changes will be in the 1.11.8 AMI's which haven't yet been released, but should be quite soon.

benjamin658 commented 5 years ago

Our EKS cluster(1.11) with AMI (ami-0f54a2f7d2e9c88b3) facing the same issue randomly, and it kills my production services many times per day.

I was wondering if I upgraded the EKS cluster to 1.12 and using the latest AMI ami-0923e4b35a30a5f53 could solve this problem. (follow these steps https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html)

pod2metra commented 5 years ago

Same issue no Server Version: v1.12.6-eks-d69f1b and AMI ami-0abcb9f9190e867ab.

avielb commented 5 years ago

same issue here. running with eks 1.12 and latest AMIs on us-east-1

benjamin658 commented 5 years ago

It seems to be caused by the "Out of Memory" error on the kubelet host. After add the BootstrapArguments to the cloudformation, NotReady state is no longer happening.

螢幕快照 2019-04-29 下午3 39 13

Here is my BootstrapArguments:

--kubelet-extra-args "--kube-reserved memory=0.3Gi,ephemeral-storage=1Gi --system-reserved memory=0.2Gi,ephemeral-storage=1Gi --eviction-hard memory.available<200Mi,nodefs.available<10%"

tckb commented 5 years ago

@benjamin658/ others can you confirm this? I did not see any of such errors in logs

benjamin658 commented 5 years ago

@benjamin658/ others can you confirm this? I did not see any of such errors in logs

Im not 100 percent sure, but after I added the BootstrapArguments, our cluster is working well now.

dijeesh commented 5 years ago

Having the same issue.

EKS v1.12.6-eks-d69f1b AMI ami-0abcb9f9190e867ab

tckb commented 5 years ago

@dijeesh did you try the suggestions from @benjamin658

AmazingTurtle commented 5 years ago

I'm experiencing the same issues. The problems started when I installed gitlab-runner using helm and spawned ~20 jobs in a hour or so.

Nodes running v1.12.7 on AMI ami-0d741ed58ca5b342e

I have weavescope installed in my cluster and when looking at Hosts/Resources I see many containers named svc-0 (which are coming from gitlab). They are docker containers that should have been deleted (and somehow.. they are, because when I search for them on the nodes using the docker CLI they are gone. kubernetes provides no further information as well). That might be a weavescope bug, but if not: this might be a hint on the node NotReady issues

edit: ran into a CNI issue as well (network addresses per host exhausted, reached pod limit aka "insufficient pods") see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI In my particular case I was using t3.small instances (3 interfaces x 4 ips each = 12 addresses of which 11 are assignable). This might also be a cause for changing a node status to NotReady

whereisaaron commented 5 years ago

I thought reserving resources for kubelet was default/built-in behavior of current k8s, but sounds like it is optional and EKS doesn’t do it 😢

The reserved resource for kubelets is extremely important where you use overcommitted workloads (collections of spikey workloads) i.e. any time where resource Limits >= Requests or is you don’t specify resource limits. Under node resource exhaustion you want some workloads to be rescheduled, not entire nodes to go down.

If you are using small nodes, failures like this will be more common. Plus you have the low EKS pod limit caused by ENI limitations. I’d suggest reserving some system resources on each node, and use fewer, larger nodes.

https://github.com/weaveworks/eksctl/issues/795

bobbui commented 5 years ago

this still happens on EKS 1.13. It started to happen when cluster running under some really high load.

montanaflynn commented 5 years ago

Happening to me as well, looking at kubectl get node <name> -o=yaml I see taints:

spec:
  providerID: aws:///us-east-1a/i-07b8613b0ed988d73
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: 2019-07-30T07:11:15Z

montanaflynn commented 5 years ago

I think this might be related? https://github.com/weaveworks/eksctl/issues/795

JamesDowning commented 4 years ago

We are seeing similar behaviour, what appears to be almost random/possibly coincides with a deployment. A node or two will suddenly appear to be NotReady, resource graphs indicate utilisation is hardly over 50% so oom shouldn't be an issue.

As mentioned by @AmazingTurtle we are also on 4-5 t3.small nodes with around 50 pods so we may be seeing effects of exhausted network addresses despite not seeing these logs.

In line with @montanaflynn the node has the following taints suddenly applied:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  MemoryPressure   Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  DiskPressure     Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  PIDPressure      Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  Ready            Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  OutOfDisk        Unknown   Fri, 29 Nov 2019 14:03:02 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusNeverUpdated   Kubelet never posted node status.

Kubernetes version
1.13

Platform version
eks.6

I'm going to try increasing node size and adding some resource limits to deployments that may not have them correctly configured.

AmazingTurtle commented 4 years ago

@JamesDowning Unless you're considering to run production deployments on azure, you should go with any other service. Even self-hosted would be better that AWS EKS IMHO. It's just expensive as hell. Have you had a look on Fargate? Using the CDK you can bootstrap apps so easily without worrying about infrastructure. It just runs. I gave it a shot a couple of days ago and it's just sexy and works the way I want it --- not like EKS.

jontro commented 4 years ago

Im getting this on a t3.small node

spec:
  providerID: aws:///eu-west-1b/i-<redacted>
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2020-05-20T11:36:58Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2020-05-20T11:37:03Z"

What is adding these taints and will they ever get removed?

brsolomon-deloitte commented 1 year ago

Seeing this on Amazon EKS 1.26.

seyal84 commented 1 year ago

What was the resolution this issue as this still persists on eks v1.23?

dimittal commented 1 year ago

We are facing this issue also on a daily basis. Any resolution for this?

cartermckinnon commented 1 year ago

@dimittal this can happen for many reasons, please open a new issue with details of your environment and symptoms.

awslabs / amazon-eks-ami

Node becomes NotReady #79