Closed agcooke closed 5 years ago
I am using the eks optimised default ami as per the guide in eks documentation.
Are you using the latest one or at least the one that was pushed in early Jan. It has a fix to tell kubelet
to restart if it dies from SIGPIPE
which happens if journald
gets killed by a lack of resources. If you run cat /etc/systemd/system/kubelet.service | grep "RestartForceExitStatus"
on one of your worker nodes, you should see RestartForceExitStatus=SIGPIPE
as output. If not, then you don't have the fix that should restart kubelet
if journald
dies.
@jaredeis on behalf of @rakeshpatri we tested it i can see RestartForceExitStatus=SIGPIPE
as output
what more can we try ??
I guess the only other thing I could suggest is opening a case with AWS to see if they can help determine why kubelet is dying on your nodes.
I have been seeing the same issue. we are on t2-medium and one node goes "NotReady" sporadically. The only solution I had was to detach and spawn another node. restart does not work sigh any interim solution for this? @agcooke / others is your fork available somewhere? clearly aws has no time for this one! 😠
Most of the important changes I made are now merged into the latest AMI.
I would make sure that you are using EBS_Optimized ec2 instances and upgrade your docker version by changing the version in the packer file.
My fork is here https://github.com/agcooke/amazon-eks-ami. I have moved onto another project so have not kept it up to date with the aws image.
I have been seeing the same issue. we are on t2-medium and one node goes "NotReady" sporadically. The only solution I had was to detach and spawn another node. restart does not work sigh any interim solution for this? @agcooke / others is your fork available somewhere? clearly aws has no time for this one! 😠
We have setup cluster Autoscaler as well as have set resource limits on all the deployments in kubernetes. Also use HPA for your deployment that is consuming more resource. By applying these changes we have not faced this issue anymore.
@agcooke were the change released? we are currently running amazon-eks-node-1.11-v20190220
and unfortunately, the issue remains
@tckb I've seen it suggested some of these changes will be in the 1.11.8 AMI's which haven't yet been released, but should be quite soon.
Our EKS cluster(1.11) with AMI (ami-0f54a2f7d2e9c88b3) facing the same issue randomly, and it kills my production services many times per day.
I was wondering if I upgraded the EKS cluster to 1.12 and using the latest AMI ami-0923e4b35a30a5f53 could solve this problem. (follow these steps https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html)
Same issue no Server Version: v1.12.6-eks-d69f1b
and AMI ami-0abcb9f9190e867ab
.
same issue here. running with eks 1.12 and latest AMIs on us-east-1
It seems to be caused by the "Out of Memory" error on the kubelet host.
After add the BootstrapArguments
to the cloudformation, NotReady
state is no longer happening.
Here is my BootstrapArguments:
--kubelet-extra-args "--kube-reserved memory=0.3Gi,ephemeral-storage=1Gi --system-reserved memory=0.2Gi,ephemeral-storage=1Gi --eviction-hard memory.available<200Mi,nodefs.available<10%"
@benjamin658/ others can you confirm this? I did not see any of such errors in logs
@benjamin658/ others can you confirm this? I did not see any of such errors in logs
Im not 100 percent sure, but after I added the BootstrapArguments, our cluster is working well now.
Having the same issue.
EKS v1.12.6-eks-d69f1b
AMI ami-0abcb9f9190e867ab
@dijeesh did you try the suggestions from @benjamin658
I'm experiencing the same issues. The problems started when I installed gitlab-runner
using helm and spawned ~20 jobs in a hour or so.
Nodes running v1.12.7
on AMI ami-0d741ed58ca5b342e
I have weavescope installed in my cluster and when looking at Hosts/Resources
I see many containers named svc-0
(which are coming from gitlab). They are docker containers that should have been deleted (and somehow.. they are, because when I search for them on the nodes using the docker CLI they are gone. kubernetes provides no further information as well). That might be a weavescope bug, but if not: this might be a hint on the node NotReady issues
edit: ran into a CNI issue as well (network addresses per host exhausted, reached pod limit aka "insufficient pods")
see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI
In my particular case I was using t3.small
instances (3 interfaces x 4 ips each = 12 addresses of which 11 are assignable). This might also be a cause for changing a node status to NotReady
I thought reserving resources for kubelet was default/built-in behavior of current k8s, but sounds like it is optional and EKS doesn’t do it 😢
The reserved resource for kubelets is extremely important where you use overcommitted workloads (collections of spikey workloads) i.e. any time where resource Limits >= Requests or is you don’t specify resource limits. Under node resource exhaustion you want some workloads to be rescheduled, not entire nodes to go down.
If you are using small nodes, failures like this will be more common. Plus you have the low EKS pod limit caused by ENI limitations. I’d suggest reserving some system resources on each node, and use fewer, larger nodes.
this still happens on EKS 1.13. It started to happen when cluster running under some really high load.
Happening to me as well, looking at kubectl get node <name> -o=yaml
I see taints:
spec:
providerID: aws:///us-east-1a/i-07b8613b0ed988d73
taints:
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: 2019-07-30T07:11:15Z
I think this might be related? https://github.com/weaveworks/eksctl/issues/795
We are seeing similar behaviour, what appears to be almost random/possibly coincides with a deployment. A node or two will suddenly appear to be NotReady, resource graphs indicate utilisation is hardly over 50% so oom shouldn't be an issue.
As mentioned by @AmazingTurtle we are also on 4-5 t3.small
nodes with around 50 pods so we may be seeing effects of exhausted network addresses despite not seeing these logs.
In line with @montanaflynn the node has the following taints suddenly applied:
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Fri, 29 Nov 2019 14:18:03 +0000 Fri, 29 Nov 2019 14:18:48 +0000 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Fri, 29 Nov 2019 14:18:03 +0000 Fri, 29 Nov 2019 14:18:48 +0000 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Fri, 29 Nov 2019 14:18:03 +0000 Fri, 29 Nov 2019 14:18:48 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Fri, 29 Nov 2019 14:18:03 +0000 Fri, 29 Nov 2019 14:18:48 +0000 NodeStatusUnknown Kubelet stopped posting node status.
OutOfDisk Unknown Fri, 29 Nov 2019 14:03:02 +0000 Fri, 29 Nov 2019 14:18:48 +0000 NodeStatusNeverUpdated Kubelet never posted node status.
Kubernetes version
1.13
Platform version
eks.6
I'm going to try increasing node size and adding some resource limits to deployments that may not have them correctly configured.
Im getting this on a t3.small node
spec:
providerID: aws:///eu-west-1b/i-<redacted>
taints:
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2020-05-20T11:36:58Z"
- effect: NoExecute
key: node.kubernetes.io/unreachable
timeAdded: "2020-05-20T11:37:03Z"
What is adding these taints and will they ever get removed?
Seeing this on Amazon EKS 1.26.
What was the resolution this issue as this still persists on eks v1.23?
We are facing this issue also on a daily basis. Any resolution for this?
@dimittal this can happen for many reasons, please open a new issue with details of your environment and symptoms.
We are running EKS in Ireland and our nodes are going unhealthy regularly.
It is not possible to SSH to the host, pods are not reachable. We have experienced this with t2.xlarge, t2.small and t3.medium instances.
We could ssh to another node in the cluster and ping the NotReady node, but are not able to ssh it either.
Graphs show the memory goes high at about the same time that the journalctl logs below. The EBS IO also goes high. The exact time is hard to pinpoint. I added logs with interesting 'failures' around the time that we think the node disappeared.
We are using the cluster for running tests, so pods are getting created and destroyed often.
We have not done anything described in https://github.com/awslabs/amazon-eks-ami/issues/51 for log rotation.
Cluster Information: CNI: Latest daemonset with image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.2.1 Region: eu-west-1
LOGS
Node AMI
File system
kubectl describe node
journalctl logs around the time
plugin logs
ipamd.log