Closed toredash closed 8 months ago
I just realised that the TTL for a node to bootstrap is set to 15minutes: https://github.com/kubernetes-sigs/karpenter/blob/46d3d646ea3784a885336b9c40fd22f406601441/pkg/controllers/nodeclaim/lifecycle/liveness.go#L40
This becomes interesting if we did not use Cilium: Would EKS attempt to schedule pods to this non-functioning node ?
You're right, if the kubelet was reporting the node as healthy, even if Karpenter didn't consider it ready, the kube-scheduler would and would begin to schedule pods to it. Karpenter could do periodic health checks on nodes, though you could run into the issue of non-disruptible pods scheduling to the unhealthy nodes. If they are in-fact able to run in some cases I don't think Karpenter could safely disrupt the node. Thinking aloud, I'm wondering if it could make sense for Karpenter to apply a startup taint to nodes that it doesn't remove until Karpenter thinks the nodes are ready, with one of those conditions being that the node is in a running state.
I just realised that the TTL for a node to bootstrap is set to 15minutes
The TTL is only set to 15 minutes for nodes that never actually join the cluster; however, it sounds like you are seeing that nodes do join the cluster and just that the node never goes into a ready state due to other hardware issues.
It seems like there is a lot of overlap here with https://github.com/kubernetes-sigs/karpenter/issues/750 which we are tracking more directly. I'm going to close this issue in favor of that one. I'd encourage you to go check out that issue, +1 it, and see if there is any additional content you think would be relevant to the discussion there as we're thinking about how to solve this problem.
We're currently prioritizing that issue in the v1.x
backlog, so we see it as a high priority but don't plan on hitting it until a bit after v1. We haven't heard a lot of users hitting consistent issues with EC2 startup; I'd be curious to hear why this is happening so frequently and if this is something that we can push EC2 to solve through support since I wouldn't expect that you would be experiencing so much failure.
The TTL is only set to 15 minutes for nodes that never actually join the cluster; however, it sounds like you are seeing that nodes do join the cluster and just that the node never goes into a ready state due to other hardware issues.
Thats correct understood. The EC2 instance does indeed start, the kubelet joins the cluster and reports healthy, and karpenter does in fact consider it as working. But the node itself does lever leave the EC2 State: Pending.
When this occurred, we did not have remote shell activated to see if at what level the instance what actually experiencing any hardware issues. This is not present, so if we see this again I'll re-open the issue if there is anything relevant to share.
It seems like there is a lot of overlap here with kubernetes-sigs/karpenter#750 which we are tracking more directly. I'm going to close this issue in favor of that one. I'd encourage you to go check out that issue, +1 it, and see if there is any additional content you think would be relevant to the discussion there as we're thinking about how to solve this problem.
I'm actually not sure if #750 is covering the use-case in my reported issue, as it seems that the use-cases there are for nodes that does not report Ready. My nodes did indeed report Ready, but the node had a startup-taint that was not removed, as the EC2 instance did have hardware issues making it unable to attach another ENI.
We're currently prioritizing that issue in the
v1.x
backlog, so we see it as a high priority but don't plan on hitting it until a bit after v1. We haven't heard a lot of users hitting consistent issues with EC2 startup; I'd be curious to hear why this is happening so frequently and if this is something that we can push EC2 to solve through support since I wouldn't expect that you would be experiencing so much failure.
Well HW issues occur, so I don' think support can provide any details. The company is AWS Enterprise customer, and the TAM has been informed about this issue.
@jonathan-innis We had another occurrence of hw issue on an a EC2 instance.
We managed to establish an SSM session towards the instance. So from a networking perspective, this node seems to be working. The kubelet init script runs and all seems fine.
Noticeable errors I've found:
5740 log.go:194] http: TLS handshake error from 10.209.136.224:60536: no serving certificate available for the kubelet
That is the IP for the EKS control plane. It seems the kubelet is unable to obtain the certificate?
That's strange, why would the kubelet report as working when it is ... not?
[root@ip-10-209-138-36 log]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubelet-args.conf, 30-kubelet-extra-args.conf
Active: active (running) since Tue 2024-02-27 09:26:29 UTC; 3h 9min ago
Docs: https://github.com/kubernetes/kubernetes
Process: 5730 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
Main PID: 5740 (kubelet)
Tasks: 24
Memory: 101.6M
CGroup: /runtime.slice/kubelet.service
└─5740 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-c...
Feb 27 12:35:39 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:39.313837 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33060: no serving certificate available for the kubelet
Feb 27 12:35:40 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:40.358469 5740 log.go:194] http: TLS handshake error from 10.254.53.81:49352: no serving certificate available for the kubelet
Feb 27 12:35:40 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:40.367121 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33070: no serving certificate available for the kubelet
Feb 27 12:35:40 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:40.890979 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33080: no serving certificate available for the kubelet
Feb 27 12:35:41 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:41.293746 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33092: no serving certificate available for the kubelet
Feb 27 12:35:42 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:42.442106 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33100: no serving certificate available for the kubelet
Feb 27 12:35:43 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:43.144907 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33102: no serving certificate available for the kubelet
Feb 27 12:35:44 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:44.442592 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33112: no serving certificate available for the kubelet
Feb 27 12:35:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:45.135552 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33126: no serving certificate available for the kubelet
Feb 27 12:35:46 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:46.047502 5740 log.go:194] http: TLS handshake error from 10.209.136.224:33134: no serving certificate available for the kubelet
I manually approved the certificate, and the kubelet proceeds:
% kubectl certificate approve csr-f69z5
certificatesigningrequest.certificates.k8s.io/csr-f69z5 approved
[...]
Feb 27 12:37:44 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:44.447377 5740 log.go:194] http: TLS handshake error from 10.209.136.224:55130: no serving certificate available for the kubelet
Feb 27 12:37:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:45.331070 5740 log.go:194] http: TLS handshake error from 10.209.136.224:55146: no serving certificate available for the kubelet
Feb 27 12:37:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:45.510453 5740 csr.go:261] certificate signing request csr-f69z5 is approved, waiting to be issued
Feb 27 12:37:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:45.523411 5740 csr.go:257] certificate signing request csr-f69z5 is issued
Feb 27 12:37:46 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:46.524676 5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Certificate expiration is 2025-02-26 12:33:00 +0000 UTC, rotation deadline is 2024-12-20 07:46:46.160553788 +0000 UTC
Feb 27 12:37:46 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:46.524707 5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Waiting 7123h8m59.635850965s for next certificate rotation
Feb 27 12:37:47 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:47.524783 5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Certificate expiration is 2025-02-26 12:33:00 +0000 UTC, rotation deadline is 2024-11-13 06:20:45.61150787 +0000 UTC
Feb 27 12:37:47 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:47.524816 5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Waiting 6233h42m58.086694399s for next certificate rotation
Feb 27 12:37:53 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:53.056522 5740 scope.go:117] "RemoveContainer" containerID="b812882bb855e0051fe64eb5ee63518ed02c147e865b82b5f94b92a5a71f9664"
I don't get why this happens randomly. I'm assuming the hardware error/issue is real, but the fact that kubelet reports OK when it is clearly not seems like an EKS issue.
Karpenter is still un-aware that this node is not working at all. The instance state is still Pending, and the API call ec2:AttachNetworkInterface continues to fail since the state is not Running|Stopped.
I'e created a AWS Enterprise support case on the matter: 170841858601906
hi @toredash , any news here ?
hi @toredash , any news here ?
Yes, AWS Support said this is working correctly as is, and this issue should be filed against https://github.com/kubernetes/cloud-provider-aws if I believe this is an issue
@jonathan-innis Is there a chance to revisit this issue?
I would argue that Karpenter should only consider a node for fully joined if the Kubelet is returning Ready and the EC2 instance is in state Running. As it stands now Karpenter is not aware that an EC2 instance could have underlying issues, which would be identified by an instance not transitioning from pending to running state.
I'm not that familiar with Golang, so I'm not sure where the logic for this should be placed. Could this be an enhancement of the init process? // Reconcile checks for initialization based on if: // a) its current status is set to Ready // b) all the startup taints have been removed from the node // c) all extended resources have been registered // This method handles both nil nodepools and nodes without extended resources gracefully.
Description
Observed Behavior: High-level: EC2 instances in Pending state are not removed by karpenter.
We are currently experiencing a higher-than-normal of EC2 instances which have hardware issues and are not functional. The instances are in forever Pending state after they have been provisioned by Karpenter. As the state of the EC2 instance never transitions from Pending-state, we assumed that karpenter would after a while mark the instance as not healthy and replace it.
Some background information:
When describing the instance, status fields are either pending or attaching. AWS support confirmed that the physical server had issues. Note the
State.Name
,BlockDeviceMappings[].EBS.Status
,NetworkInterfaces[].Attachment.Status
fields fromaws ec2 describe-instances
: (some data removed)The nodeclaim:
Relevant logs for nodeclaim standard-instance-store-x6wxs:
The EC2 node in question in kubernetes:
Note that we are using Cilium as the CNI. Cilium will in normal operations remove the taint
node.cilium.io/agent-not-ready
on the node once the cilium-agent is running on the node. The Cilium operator attempts to attach an additional ENI on the host via ec2:AttachNetworkInterface. AWS Audit log entry below, notice theerrorMessage
:The strange thing is that the Pending instance seems to be, working, kinda. Pods that use hostNetwork:true are able to run on this instance, and they seem to work. Kubelet is reporting that it is ready. Fetching logs from a pod running on the node fails though:
Error from server: Get "https://10.209.146.79:10250/containerLogs/kube-system/cilium-operator-5695bfbb6b-gm9ch/cilium-operator": remote error: tls: internal error
Expected Behavior: I'm not really sure to be honest. The NodeClaim is stuck in Ready:false as Cilium is not removing the taints, as the operator is not able to attach an ENI to the instance. As the EC2 API reports the instance as Pending, I would expect karpenter to mark the node as failed/not working a remove it.
So what I think should happen, is that karpenter would mark EC2 nodes that are in state Pending for more than 15minutes, to be marked as not ready and decommissioned
Reproduction Steps (Please include YAML):
Versions:
v0.34.0
kubectl version
):Server Version: v1.27.9-eks-5e0fdde