Closed agcooke closed 5 years ago
Any ideas on this?
All those weird log messages about memory in your journalctl
look a lot like OOM messages (but don't specifically mention OOM being started). And the Call Trace
line looks like a Kernel bug.
One does free -m
say on these nodes?
The issues happening again, here are the logs during the node status becoming NotReady
kubectl describe node
kubectl describe node ip-<secret>.eu-west-1.compute.internal
Name: ip-<secret>.eu-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1b
kubernetes.io/hostname=ip-<secret>.eu-west-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Tue, 30 Oct 2018 11:09:42 +0100
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Fri, 09 Nov 2018 13:15:13 +0100 Fri, 09 Nov 2018 13:15:56 +0100 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Fri, 09 Nov 2018 13:15:13 +0100 Fri, 09 Nov 2018 13:15:56 +0100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Fri, 09 Nov 2018 13:15:13 +0100 Fri, 09 Nov 2018 13:15:56 +0100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Fri, 09 Nov 2018 13:15:13 +0100 Tue, 30 Oct 2018 11:09:41 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Fri, 09 Nov 2018 13:15:13 +0100 Fri, 09 Nov 2018 13:15:56 +0100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: <secret>
Hostname: ip-<secret>.eu-west-1.compute.internal
Capacity:
cpu: 2
ephemeral-storage: 67096556Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3980344Ki
pods: 17Allocatable:
cpu: 2 ephemeral-storage: 61836185908
hugepages-1Gi: 0 hugepages-2Mi: 0
memory: 3877944Ki pods: 17
System Info:
Machine ID: ec2d77ff3bd31c1a42446de3e9ddd79e
System UUID: EC2D77FF-3BD3-1C1A-4244-6DE3E9DDD79E Boot ID: da26a6a5-3a46-4405-8511-7bad8ecbbf3f
Kernel Version: 4.14.62-70.117.amzn2.x86_64 OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.6.2
Kubelet Version: v1.10.3
Kube-Proxy Version: v1.10.3
ExternalID: i-<secret>
ProviderID: aws:///eu-west-1b/i-<secret>
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system aws-node-2bp7x 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-k4c68 100m (5%) 0 (0%) 0 (0%) 0 (0%)
monitoring datadog-datadog-474bp 200m (10%) 200m (10%) 256Mi (6%) 256Mi (6%)
monitoring prometheus-node-exporter-8n285 0 (0%) 0 (0%) 0 (0%) 0 (0%)
monitoring prometheus-server-5997957687-wlc4w 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t-1 t-1-5559cc964d-w9hlt 250m (12%) 250m (12%) 700Mi (18%) 700Mi (18%)
t-2 t-2-697fdf56d8-6xrzx 250m (12%) 250m (12%) 700Mi (18%) 700Mi (18%)
t-3 t-3-697896bf47-7fwmh 250m (12%) 250m (12%) 700Mi (18%) 700Mi (18%)
t-4 t-4-65c7d884b-66b46 250m (12%) 250m (12%) 700Mi (18%) 700Mi (18%)
t-5 t-5-66f49688cd-nktgl 0 (0%) 0 (0%) 0 (0%) 0 (0%)
tr-6 tr-6-7bf488684b-7ct4r 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
1310m (65%) 1200m (60%) 3056Mi (80%) 3056Mi (80%)
Events: <none>
Journalctl
Oct 30 13:19:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:19:10 ip-10-0-1-121.eu-west-1.compute.internal dhclient[2676]: XMT: Solicit on eth0, interval 108540ms.
Oct 30 13:19:16 ip-10-0-1-121.eu-west-1.compute.internal kubelet[3629]: E1030 13:19:16.255590 3629 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-10-0-1-121.eu-west-1.compute.internal": Unauthorized
Oct 30 13:19:17 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 25993 (agent)
Oct 30 13:19:17 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:19:17 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 25993 (agent)
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 28579 (agent)
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 28579 (agent)
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:20:59 ip-10-0-1-121.eu-west-1.compute.internal dhclient[2676]: XMT: Solicit on eth0, interval 130410ms.
Oct 30 13:21:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 29006 (agent)
Oct 30 13:21:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:21:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 29006 (agent)
ipamd.log
2018-11-09T12:35:08Z [DEBUG] IP Address Pool stats: total: 15, assigned: 1
2018-11-09T12:35:08Z [DEBUG] Reconciled IP 10.0.1.94 on eni eni-0c6011f1b1a05f13b
2018-11-09T12:35:08Z [DEBUG] Successfully Reconciled ENI/IP pool
2018-11-09T12:35:11Z [INFO] Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-cypress-2115-xmmkr
2018-11-09T12:35:12Z [INFO] Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-mobile-2115-75jr5
2018-11-09T12:35:12Z [INFO] Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-nose-functional-fast-2115-wv5gt
2018-11-09T12:35:13Z [INFO] Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-nose-functional-slow-2115-rztn4
2018-11-09T12:35:13Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:13Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:13Z [DEBUG] No ENI can be deleted at this time
2018-11-09T12:35:13Z [ERROR] Failed to decrease pool free ENI: no ENI can be deleted at this time
2018-11-09T12:35:13Z [INFO] Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-nose-unit-2115-npbdg
2018-11-09T12:35:14Z [INFO] Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-pytest-2115-2wjj9
2018-11-09T12:35:15Z [INFO] Add/Update for Pod prometheus-server-5997957687-wlc4w on my node, namespace = monitoring, IP = 10.0.1.136
2018-11-09T12:35:15Z [INFO] Pods deleted on my node: monitoring/prometheus-server-5997957687-wlc4w
2018-11-09T12:35:18Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:18Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:18Z [DEBUG] No ENI can be deleted at this time
2018-11-09T12:35:18Z [ERROR] Failed to decrease pool free ENI: no ENI can be deleted at this time
2018-11-09T12:35:23Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:23Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:23Z [DEBUG] FreeENI: found a deletable ENI eni-0abc9a4f1747dbc0e
2018-11-09T12:35:23Z [INFO] FreeENI eni-0abc9a4f1747dbc0e: IP address pool stats: free 5 addresses, total: 10, assigned: 1
2018-11-09T12:35:23Z [DEBUG] Start freeing eni eni-0abc9a4f1747dbc0e
2018-11-09T12:35:23Z [INFO] Trying to free eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:23Z [DEBUG] Found eni eni-0abc9a4f1747dbc0e attachment id: eni-attach-0e57c7548fe9c7da3
2018-11-09T12:35:23Z [DEBUG] Trying to delete eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:24Z [DEBUG] Not able to delete eni yet (attempt 1/20): InvalidParameterValue: Network interface 'eni-0abc9a4f1747dbc0e' is currently in use.
status code: 400, request id: 843583a7-3416-4d7b-b5fd-1393421a5f04
2018-11-09T12:35:29Z [DEBUG] Not able to delete eni yet (attempt 2/20): InvalidParameterValue: Network interface 'eni-0abc9a4f1747dbc0e' is currently in use.
status code: 400, request id: 1501e10c-2026-4377-8d66-14630c3da533
2018-11-09T12:35:34Z [DEBUG] Not able to delete eni yet (attempt 3/20): InvalidParameterValue: Network interface 'eni-0abc9a4f1747dbc0e' is currently in use.
status code: 400, request id: 009a38d0-2824-4bb2-b707-6a31cece470f
2018-11-09T12:35:39Z [INFO] Successfully deleted eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:39Z [INFO] Successfully freed eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:39Z [DEBUG] Successfully decreased IP Pool
2018-11-09T12:35:39Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:44Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:44Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:49Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:49Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5```
So we are getting no help from AWS via support and here.
Issues:
Thus the AMI is not ready for production use! We will have to fork it like the 90 other people who are doing this.
Probably use: Lumslabs or AdvMicrogrid as a base.
Very poor support from AWS :( They told us that the image is setup for default workloads. We are not doing anything special... and we loose two nodes a day.
@agcooke thanks for this report, and sorry for the poor support :( We'll take a look into this
@agcooke Can you email me your Support case ID and Account ID? I'll see what I can track down on my end. mhausler@amazon.com
Thanks @micahhausler , I will reach out to you in email and when we figure out the solution let people know.
One thing I did not mention above is that we are using 64GB GP2 EBS volumes on the node.
It's very hand wavy, but it seems that the node that runs prometheus (which uses a GP2 PVC) flaps to NonReady and back every few days. We will investigate using IOPS volumes for instances host and PVC's to see if it helps resolve the issue. LumoLabs also use an ext4 filesystem and I think a separate volume for docker, that could help with the IO issues.
But I do not believe that IO is the root cause... We should not be seeing kernel level stack traces.
I have seen this on 2 clusters in the last few weeks
We were seeing this frequently. Our current theory is it's related to running out of space on the root filesystem which caused the kubelet to die silently.
@gflarity Great we will see if that is a cause, we are now running the v25 AMI and it should rotate logs better so the disks should not fill up too much.
We were having these issues on t2.large and t3.medium nodes.
We had a talk with SA's from AWS and they suggested that we try use c5.large or m4.large nodes.
Summary of the issues we are seeing:
Will keep you up to date with the progress.
@joberdick If you can add some logs to help triage the root cause it would be great!
@agcooke I don't think the rotation fix is in v25, it was just merged two days ago. I might be wrong but when I looked at the dates yesterday v25 was put out like 8 days ago.
When you log into the nodes, is the kubelet silent crashing (ie disappearing). We had this happen a couple days ago and the root fs was full so we think kubelet tried to log before it crashed and couldn't. It was pretty clear from the sys logs that the FS was full.
What does kubectl get events
say? I think you should see a DiskPressure
event before the node dies if disk space is the problem?
@max-rocket-internet (we believe lack of free space is this is the root cause of our previous Not Ready issues, but this particular one we're fairly certain of it)
2018-11-16 08:38:04,403245 INFO [1] [region_selectable.go:16] targeting region us-west-2
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
15m 1h 14 <redacted> Node Warning FreeDiskSpaceFailed kubelet, ip-10-<redacted> (combined from similar events): failed to garbage collect required amount of images. Wanted to free 3592427929 bytes, but freed 0 bytes
17s 1h 16 <redacted> Node Warning ImageGCFailed kubelet, <redacted> (combined from similar events): failed to garbage collect required amount of images. Wanted to free 3865643417 bytes, but freed 0 bytes
The pattern we see is that there was no logging indicating why kubectl dies... not having space to log is a good excuse for that, and probably the root cause. Unfortunately our disk usage tracking wasn't setup properly so we can't go back and confirm this.
@gflarity
I don't get any events when doing kubectl get events
kubectl describe node
ip-192-168-145-119.ec2.internal
Name: ip-192-168-145-119.ec2.internal
Roles:
OutOfDisk Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. MemoryPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure False Tue, 13 Nov 2018 17:26:36 -0500 Fri, 02 Nov 2018 22:39:57 -0400 KubeletHasSufficientPID kubelet has sufficient PID available Ready Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 192.168.145.119 Hostname: ip-192-168-145-119.ec2.internal Capacity: cpu: 2 ephemeral-storage: 20959212Ki hugepages-2Mi: 0 memory: 4040024Ki pods: 17 Allocatable: cpu: 2 ephemeral-storage: 19316009748 hugepages-2Mi: 0 memory: 3937624Ki pods: 17 System Info: Machine ID: c1c0616b96eb44c3b7b0d094dcd9536e System UUID: EC2CEF1E-2CFC-16F6-F994-9C24FC7754E2 Boot ID: 38401493-f278-4779-8186-f09cf3b39f5d Kernel Version: 4.14.62-70.117.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: docker://17.6.2 Kubelet Version: v1.10.3 Kube-Proxy Version: v1.10.3 ExternalID: i-0857bdc11bf82764b ProviderID: aws:///us-east-1b/i-0857bdc11bf82764b Non-terminated Pods: (12 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
default sleep-8677f78569-6gvt5 0 (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-citadel-557f97f46d-dqbbd 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-sidecar-injector-8d65456b6-6bqn7 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-58cf9fbbfb-prwgd 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system prometheus-7fcd5846db-89lph 10m (0%) 0 (0%) 0 (0%) 0 (0%) kube-system aws-node-rkm2l 10m (0%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-dns-64b69465b4-9ldd7 260m (13%) 0 (0%) 110Mi (2%) 170Mi (4%) kube-system kube-proxy-jtlqm 100m (5%) 0 (0%) 0 (0%) 0 (0%) kube-system tiller-deploy-778f674bf5-r7c4b 0 (0%) 0 (0%) 0 (0%) 0 (0%) teama leads-api-75cdc7b454-cjwmt 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%) teama loans-ui-847df86c67-m27s6 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%) teamb sleep-8677f78569-j982c 10m (0%) 10m (0%) 10Mi (0%) 10Mi (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits
650m (32%) 2010m (100%) 140Mi (3%) 200Mi (5%)
Events:
OutOfDisk Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. MemoryPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure False Tue, 13 Nov 2018 17:26:36 -0500 Fri, 02 Nov 2018 22:39:57 -0400 KubeletHasSufficientPID kubelet has sufficient PID available Ready Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 192.168.145.119 Hostname: ip-192-168-145-119.ec2.internal Capacity: cpu: 2 ephemeral-storage: 20959212Ki hugepages-2Mi: 0 memory: 4040024Ki pods: 17 Allocatable: cpu: 2 ephemeral-storage: 19316009748 hugepages-2Mi: 0 memory: 3937624Ki pods: 17 System Info: Machine ID: c1c0616b96eb44c3b7b0d094dcd9536e System UUID: EC2CEF1E-2CFC-16F6-F994-9C24FC7754E2 Boot ID: 38401493-f278-4779-8186-f09cf3b39f5d Kernel Version: 4.14.62-70.117.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: docker://17.6.2 Kubelet Version: v1.10.3 Kube-Proxy Version: v1.10.3 ExternalID: i-0857bdc11bf82764b ProviderID: aws:///us-east-1b/i-0857bdc11bf82764b Non-terminated Pods: (12 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
default sleep-8677f78569-6gvt5 0 (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-citadel-557f97f46d-dqbbd 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-sidecar-injector-8d65456b6-6bqn7 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-58cf9fbbfb-prwgd 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system prometheus-7fcd5846db-89lph 10m (0%) 0 (0%) 0 (0%) 0 (0%) kube-system aws-node-rkm2l 10m (0%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-dns-64b69465b4-9ldd7 260m (13%) 0 (0%) 110Mi (2%) 170Mi (4%) kube-system kube-proxy-jtlqm 100m (5%) 0 (0%) 0 (0%) 0 (0%) kube-system tiller-deploy-778f674bf5-r7c4b 0 (0%) 0 (0%) 0 (0%) 0 (0%) teama leads-api-75cdc7b454-cjwmt 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%) teama loans-ui-847df86c67-m27s6 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%) teamb sleep-8677f78569-j982c 10m (0%) 10m (0%) 10Mi (0%) 10Mi (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits
650m (32%) 2010m (100%) 140Mi (3%) 200Mi (5%)
Events:
Unfortunately, this environment does not have a key pair setup for EC2 so the info that I will be able gather will be limited.
Any suggestions on what else to check without without being on the host? I am going to try autoscaling after adding a key pair, but was avoiding touching anything until i hear from aws support
@joberdick
/shrug there might be multiple causes of this...
Can you ssh into the node? That's how I first diagnosed our issue.
We have been running m4.large and are not seeing the issues anymore.
I think the symptoms are hidden by faster network throughput.
running m4.xlarge and now nodes are dying like crazy....... before it was one a week now it's 3 today already ofc no possibility to ssh and Amazon status checks are saying node is ok (sic!)
@gacopl We are seeing similar behaviour on m4.xlarge
as well. Our guess is that https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ might help, but this is not available on Kubernetes 1.10.3 on AWS. Waiting patiently for Kubernetes 1.11 and 1.12 on AWS EKS :/
keep waiting we started around september and we are still on EKS v1 they did not even upgrade us to v2, choosing EKS was worst decision, the worst part is pods do not get resheduled when node NotReady, will have setup special healthchecks for ASG.....
@gacopl is it helpful to add additional healthchecks even if the node is still unstable?
Since the last two days we are also seeing the m4.large nodes dying all the time...
We are resorting to using another AMI, but do not have results for that yet.
We had relative stability and now things are bad again.
@radityasurya what else do you advise since only terminating node gets pod rescheduled and basic ASG checks do not see the node is unreachable
@agcooke which AMI specifically ?
We are using amazon-eks-node-v25 (ami-00c3b2d35bddd4f5c) and I have started working on using https://github.com/lumoslabs/amazon-eks-ami AMI.
If anyone know how to use a CoreOS AMI or an Ubuntu (conjure-up) AMI it would be great...
@agcooke So v25 does not solve it right ? i was about to upgrade from v23
No, just found https://cloud-images.ubuntu.com/docs/aws/eks/
We are working on an AMI release first week of December with the log rotation patch in. That should fix the issues related to disk filling up due to logs.
@eswarbala It's not only logs but also fd limits are too low on bigger instances
@agcooke
We have an Ubuntu 18.04 AMI forked from this one. Been using it for a prototype cluster, and haven't noticed this issue.
I have some launch user-data stuff i need to keep :/ so Ubuntu is not for me
@kinghajj Yes I saw your work already. I tried the lumolabs Amazon Linux one but we saw nodes crashing.
We moved our entire cluster over to your AMI today. Will let you know how it goes. I will see if I can port some of the nice ulimit changes from the LumoLabs AMI to your AMI.
The ubuntu provided AMI expects you to have used CloudFormation to create your cluster. So at this time it seems too much to get that working.
If we do not make headway we will recreate the cluster using eksctl. It supports the Ubuntu image now as well and also uses CloudFormation.
@agcooke would it not be better to debug the issue further instead of switching between all the AMIs?
for the moment setting evicition hard nodefs and kube and system reserved mem flags seemed to help we'll see
@max-rocket-internet , yes you are correct we should solve the the root cause.
I am looking at other peoples AMI's so that I can build on top of their work instead of doing the same work twice. Lumolabs and AdvMicrogrid did good work.
The ubuntu bionic AMI from AdvMicrogrid seems to working well. I also ssh'd into the node to check ulimits and it seems very permissive:
ubuntu@ip-10-0-0-109:~$ sudo prlimit -p 24954
sudo: unable to resolve host ip-10-0-0-109
RESOURCE DESCRIPTION SOFT HARD UNITS
AS address space limit unlimited unlimited bytes
CORE max core file size unlimited unlimited bytes
CPU CPU time unlimited unlimited seconds
DATA max data size unlimited unlimited bytes
FSIZE max file size unlimited unlimited bytes
LOCKS max number of file locks held unlimited unlimited locks
MEMLOCK max locked-in-memory address space 16777216 16777216 bytes
MSGQUEUE max bytes in POSIX mqueues 819200 819200 bytes
NICE max nice prio allowed to raise 0 0
NOFILE max number of open files 1048576 1048576 files
NPROC max number of processes unlimited unlimited processes
RSS max resident set size unlimited unlimited bytes
RTPRIO max real-time priority 0 0
RTTIME timeout for real-time tasks unlimited unlimited microsecs
SIGPENDING max number of pending signals 31808 31808 signals
STACK max stack size 8388608 unlimited bytes
LumoLabs added this to their dockerd.json file:
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Soft": 2048,
"Hard": 8192
}
}
Anyone have operational experience to know what a sane value is? I think the AWS EKS AMI has a nofile limit of 1000, but not sure, that has been too little for us.
@gacopl LumoLabs also set the following in their kubelet-configuration.yaml file:
kubeReserved:
cpu: "1000m"
memory: "2Gi"
systemReserved:
cpu: "500m"
memory: "1Gi"
evictionHard:
memory.available: "100Mi"
nodefs.available: "5%"
evictionSoft:
memory.available: "300Mi"
nodefs.available: "10%"
evictionSoftGracePeriod:
memory.available: "60s"
nodefs.available: "60s"
Is that similar to what you did?
I should probably open another issue for the AMI, but when we started using some default alerts with the AWS AMI v25, alerts fired frequently for this alert:
ALERT ClockSyncBroken
IF node_timex_sync_status != 1
FOR 5m
LABELS { severity="warning" }
ANNOTATIONS {
summary = "The clock is not being synced.",
impact = "Random things are about to break for our users",
detail = "Node: {{$labels.node}}",
}
I saw that @albertdixon added NTP settings in their fork. I guess that that is to solve the ClockSyncBroken issue. Should probably be synced back into the upstream AMI.
The Ubuntu based AMI does not have that problem.
@agcooke yes but i reserved little less according to some calculator at k8s.io gave system 500M and kubelet 200M good thing was i was able to set those with kubelet-extra-args in user-data without repacking AMI
How are your feelings regarding Ubuntu AMI is it worth changing to considering they have different bootstrapping etc?
@gacopl At the moment I am more positive. The changes that AdvMicrogrid made to the AMI are minimal, and the base images seems to be setup to work better by default (Newer docker, NTP and ulimits).
The biggest negative, is that it is on ubuntu and I am pretty sure the amazon-eks-ami will never change to that as a base image. Thus we will have to make sure it stays up to date with the upstream.
Using packer to build is very easy. See my branch. I just changed the makefile to use docker-compose to run packer like LumoLabs did.
Anyone have operational experience to know what a sane value is? I think the AWS EKS AMI has a nofile limit of 1000, but not sure, that has been too little for us.
For things like this and many questions of similar fashion, IMO it's best to just work out what GKE does and copy them if possible. Their k8s-as-a-service is 2 years ahead of everyone else 🤭
Since the last update, our ubuntu based nodes have worked properly.
We upgraded to 1.11.5 and the nodes do not work well. Also looking at using the latest (built ourselves) default AWS ami again and it looks OK on one of our less used clusters. Will give feedback how it works on our heavy use cluster when we move there.
Also added https://github.com/awslabs/amazon-eks-ami/pull/130 to use chrony instead of ntp to sync the node times properly.
Added some more pull requests to fix other problems we have encountered:
My tentative conclusion is that it is now solved. We will know within a few weeks.
The way I attacked the problem was just to try a lot of things, because there was a lot wrong, and I have little knowledge in running Kubernetes Nodes and EC2/VPC details. Here is the summary of the steps taken:
So there were a lot of small issues making our nodes really broken. We could carry on triaging the problem to determine the real root cause from the changes above, but I think all the changes are good and useful, so we might as well keep them all.
We will watch how the Nodes perform for a week or two, and then close the ticket.
Our complete AMI, which is forked from the AWS AMI is found here: https://github.com/agcooke/amazon-eks-ami .
EBS-optimized is true by default since some time in aws console for most of new instance types, but i understand it was terraform problem.
"Increased the Subnet sizes in the VPC where our Cluster runs. We had issues where pods could not start because they could not get an IP address, this should fix that." This will not change a thing we had big nets for workers and pods and were still getting failed to assign ip to pod (which corrects itself after sometime) but is PITA this is a CNI problem on aws
Anyways thanks for those PRs
So to add to this, we have also had issues with nodes becoming NotReady. This happens because kubelet gets killed and doesn't restart. It's killed because the CPU gets saturated and we see several "page allocation stall" messages in /var/log/messages. Per k8s docs and AWS recommendation, we set some flags ('--kube-reserved=cpu=500m,memory=500Mi,ephemeral-storage=1Gi --system-reserved=cpu=500m,memory=500Mi,ephemeral-storage=1Gi', we use m5.xlarge right now). This helped a good bit, but I have still seen a couple of nodes die for the same reason as before. I have also set some hard eviction thresholds on our latest cluster as well.
But the ask here is for kubelet to have more resiliency when it fails. If it fails, in many cases it's fine because the pods get rescheduled, but pods that are running on that node stay running. In the case of Promethues, or the way we have it setup with an EFS mount so we don't lose data when the pod dies and moves to anther node that might be in a different region, it won't start on the new node because it's still running and on the old node with dead kubelet. If kubelet would restart, the API server would see there are 2 pods running and kill one and things would stablize.
The best solution would to to have some kind of health check capability where if the ASG sees a node in a NotReady state for a period of time, it kills it.
EDIT: It looks like my concern is addressed here: https://github.com/awslabs/amazon-eks-ami/pull/137
well i did a very clumsy script that polled for notready nodes and just terminated them through aws cli requests to survive crazy times until we figured out how to apply those reserved setting in EKS. Anyways AFAIK when kubelet gets to notready state kubernetes will not evict running pods no matter what, because of not working api it cannot recognize what's going on with node. Anyways once we put sane values to reserved resources, we know have very ocassional hangs like once per 2 weeks or so But before it was crazy you couldnt even log in to instance so i doubt simple kubelet restarts would help then
We apply the settings through the bootstrap script flags. When a node is NotReady, you are right it won't evict the pod but it will reschedule them to another node. In this case, hopefully kubelet stays up and in my case doesn't go NotReady and other mechanisms to handle pods that are crushing the node to be killed or they settle down. But kubelet just dying with no restart isn't good in my opinion and limited experience.
Yes nowe we do it the same way, however before when nodes got stucked, the pods got evicted and they weren't rescheduled to other nodes not sure what was the reason possibly stucked PVs
Any solution as of now? We are facing the same issue even after using "--kube-reserved=cpu=500m,memory=500Mi,ephemeral-storage=1Gi --system-reserved=cpu=500m,memory=500Mi,ephemeral-storage=1Gi" ..
@rakeshpatri what AMI are you using?
We are running EKS in Ireland and our nodes are going unhealthy regularly.
It is not possible to SSH to the host, pods are not reachable. We have experienced this with t2.xlarge, t2.small and t3.medium instances.
We could ssh to another node in the cluster and ping the NotReady node, but are not able to ssh it either.
Graphs show the memory goes high at about the same time that the journalctl logs below. The EBS IO also goes high. The exact time is hard to pinpoint. I added logs with interesting 'failures' around the time that we think the node disappeared.
We are using the cluster for running tests, so pods are getting created and destroyed often.
We have not done anything described in https://github.com/awslabs/amazon-eks-ami/issues/51 for log rotation.
Cluster Information: CNI: Latest daemonset with image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.2.1 Region: eu-west-1
LOGS
Node AMI
File system
kubectl describe node
journalctl logs around the time
plugin logs
ipamd.log