aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.77k stars 953 forks source link

Karpenter does not terminate instances in Pending state #5706

Closed toredash closed 8 months ago

toredash commented 8 months ago

Description

Observed Behavior: High-level: EC2 instances in Pending state are not removed by karpenter.

We are currently experiencing a higher-than-normal of EC2 instances which have hardware issues and are not functional. The instances are in forever Pending state after they have been provisioned by Karpenter. As the state of the EC2 instance never transitions from Pending-state, we assumed that karpenter would after a while mark the instance as not healthy and replace it.

Some background information:

When describing the instance, status fields are either pending or attaching. AWS support confirmed that the physical server had issues. Note the State.Name, BlockDeviceMappings[].EBS.Status, NetworkInterfaces[].Attachment.Status fields from aws ec2 describe-instances: (some data removed)

"AmiLaunchIndex": 0,
"ImageId": "ami-0daf4f79825bf900f",
"InstanceId": "i-078e295d1e5549ea3",
"InstanceType": "i3.4xlarge",
"LaunchTime": "2024-02-21T05:58:45+00:00",
"Monitoring": {
    "State": "disabled"
},
"Placement": {
    "AvailabilityZone": "eu-north-1c",
    "GroupName": "",
    "Tenancy": "default"
},
"State": {
    "Code": 0,
    "Name": "pending"
},
"StateTransitionReason": "",
"BlockDeviceMappings": [
    {
        "DeviceName": "/dev/xvda",
        "Ebs": {
            "AttachTime": "2024-02-21T05:58:46+00:00",
            "DeleteOnTermination": true,
            "Status": "attaching",
            "VolumeId": "vol-01c8e9f683dfa7b06"
        }
    }
],
"ClientToken": "fleet-b9a41f87-b59d-4b3e-8612-0ea00715ca68-0",
"EbsOptimized": false,
"EnaSupport": true,
"Hypervisor": "xen",
"InstanceLifecycle": "spot",
"NetworkInterfaces": [
    {
        "Attachment": {
            "AttachTime": "2024-02-21T05:58:45+00:00",
            "AttachmentId": "eni-attach-0cdbb641145ccf6bc",
            "DeleteOnTermination": true,
            "DeviceIndex": 0,
            "Status": "attaching",
            "NetworkCardIndex": 0
        },
    }
],

"SourceDestCheck": true,
"SpotInstanceRequestId": "sir-9xzpngzn",

The nodeclaim:

Name:         standard-instance-store-x6wxs
Namespace:    
Labels:       karpenter.k8s.aws/instance-category=i
              karpenter.k8s.aws/instance-cpu=16
              karpenter.k8s.aws/instance-encryption-in-transit-supported=false
              karpenter.k8s.aws/instance-family=i3
              karpenter.k8s.aws/instance-generation=3
              karpenter.k8s.aws/instance-hypervisor=xen
              karpenter.k8s.aws/instance-local-nvme=3800
              karpenter.k8s.aws/instance-memory=124928
              karpenter.k8s.aws/instance-network-bandwidth=5000
              karpenter.k8s.aws/instance-size=4xlarge
              karpenter.sh/capacity-type=spot
              karpenter.sh/nodepool=standard-instance-store
              kubernetes.io/arch=amd64
              kubernetes.io/os=linux
              node.kubernetes.io/instance-type=i3.4xlarge
              topology.kubernetes.io/region=eu-north-1
              topology.kubernetes.io/zone=eu-north-1c
Annotations:  karpenter.k8s.aws/ec2nodeclass-hash: 14690241518068856330
              karpenter.k8s.aws/tagged: true
              karpenter.sh/managed-by: X
              karpenter.sh/nodepool-hash: 9268174783651286961
API Version:  karpenter.sh/v1beta1
Kind:         NodeClaim
Metadata:
  Creation Timestamp:  2024-02-21T05:57:38Z
  Finalizers:
    karpenter.sh/termination
  Generate Name:  standard-instance-store-
  Generation:     1
  Owner References:
    API Version:           karpenter.sh/v1beta1
    Block Owner Deletion:  true
    Kind:                  NodePool
    Name:                  standard-instance-store
    UID:                   a2aa544f-3e9f-4e08-b15f-ecd17bd8e512
  Resource Version:        954875751
  UID:                     e8e4e85f-8366-4f82-9f52-e5de137ee79f
Spec:
  Kubelet:
    Cluster DNS:
      10.255.0.10
    System Reserved:
      Cpu:                  250m
      Ephemeral - Storage:  6Gi
      Memory:               200Mi
  Node Class Ref:
    Name:  standard-instance-store
  Requirements:
    Key:       karpenter.k8s.aws/instance-local-nvme
    Operator:  Gt
    Values:
      50
    Key:       karpenter.sh/nodepool
    Operator:  In
    Values:
      standard-instance-store
    Key:       node.kubernetes.io/instance-type
    Operator:  In
    Values:
      c5d.12xlarge
      c5d.18xlarge
      c5d.24xlarge
      c5d.4xlarge
      c5d.9xlarge
      c5d.metal
      g4dn.12xlarge
      g4dn.16xlarge
      g4dn.4xlarge
      g4dn.8xlarge
      g4dn.metal
      g5.12xlarge
      g5.16xlarge
      g5.24xlarge
      g5.48xlarge
      g5.4xlarge
      g5.8xlarge
      i3.16xlarge
      i3.4xlarge
      i3.8xlarge
      i3.metal
      i3en.12xlarge
      i3en.24xlarge
      i3en.6xlarge
      i3en.metal
      i4i.12xlarge
      i4i.16xlarge
      i4i.24xlarge
      i4i.32xlarge
      i4i.4xlarge
      i4i.8xlarge
      i4i.metal
      m5d.12xlarge
      m5d.16xlarge
      m5d.24xlarge
      m5d.4xlarge
      m5d.8xlarge
      m5d.metal
      m6idn.12xlarge
      m6idn.16xlarge
      m6idn.24xlarge
      m6idn.32xlarge
      m6idn.4xlarge
      m6idn.8xlarge
      m6idn.metal
      r5d.12xlarge
      r5d.16xlarge
      r5d.24xlarge
      r5d.4xlarge
      r5d.8xlarge
      r5d.metal
      r5dn.12xlarge
      r5dn.16xlarge
      r5dn.24xlarge
      r5dn.4xlarge
      r5dn.8xlarge
      r5dn.metal
      r6idn.12xlarge
      r6idn.16xlarge
      r6idn.24xlarge
      r6idn.32xlarge
      r6idn.4xlarge
      r6idn.8xlarge
      r6idn.metal
      x2idn.16xlarge
      x2idn.24xlarge
      x2iedn.4xlarge
      x2iedn.8xlarge
    Key:       topology.kubernetes.io/zone
    Operator:  In
    Values:
      eu-north-1c
    Key:       karpenter.sh/capacity-type
    Operator:  In
    Values:
      on-demand
      spot
    Key:       karpenter.k8s.aws/instance-cpu
    Operator:  Gt
    Values:
      15
    Key:       kubernetes.io/arch
    Operator:  In
    Values:
      amd64
    Key:       kubernetes.io/os
    Operator:  In
    Values:
      linux
  Resources:
    Requests:
      Cpu:                  1200m
      Ephemeral - Storage:  1140Mi
      Memory:               2262733312
      Pods:                 14
  Startup Taints:
    Effect:  NoExecute
    Key:     node.cilium.io/agent-not-ready
    Value:   true
Status:
  Allocatable:
    Cpu:                  15640m
    Ephemeral - Storage:  3412483807232
    Memory:               112429Mi
    Pods:                 234
  Capacity:
    Cpu:                  16
    Ephemeral - Storage:  3800G
    Memory:               115558Mi
    Pods:                 234
  Conditions:
    Last Transition Time:  2024-02-21T05:59:36Z
    Message:               StartupTaint "node.cilium.io/agent-not-ready=true:NoExecute" still exists
    Reason:                StartupTaintsExist
    Status:                False
    Type:                  Initialized
    Last Transition Time:  2024-02-21T05:58:45Z
    Status:                True
    Type:                  Launched
    Last Transition Time:  2024-02-21T05:59:36Z
    Message:               StartupTaint "node.cilium.io/agent-not-ready=true:NoExecute" still exists
    Reason:                StartupTaintsExist
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-21T05:59:20Z
    Status:                True
    Type:                  Registered
  Image ID:                ami-0daf4f79825bf900f
  Node Name:               ip-10-209-146-79.eu-north-1.compute.internal
  Provider ID:             aws:///eu-north-1c/i-078e295d1e5549ea3
Events:                    <none>

Relevant logs for nodeclaim standard-instance-store-x6wxs:

{
    "level": "INFO",
    "time": "2024-02-21T05:57:38.903Z",
    "logger": "controller.disruption",
    "message": "created nodeclaim",
    "commit": "17d6c05",
    "nodepool": "standard-instance-store",
    "nodeclaim": "standard-instance-store-x6wxs",
    "requests": {
        "cpu": "1200m",
        "ephemeral-storage": "1140Mi",
        "memory": "2262733312",
        "pods": "14"
    },
    "instance-types": "c5d.12xlarge, c5d.18xlarge, c5d.24xlarge, c5d.4xlarge, c5d.9xlarge and 63 other(s)"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:39.964Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "ea2d9f2d-b7a9-4061-b65f-c1721321ee0c",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:40.978Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "b3374597-7339-4cb0-8970-30a58d1629d7",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:42.993Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "17e39ff8-17f3-48f1-986e-04c350a1d027",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:47.008Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "122c3c6d-1173-4dba-b79f-d74fc301864d",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:55.022Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "56b23a71-168e-4329-a4d0-b0dcab98c1d1",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:58:11.037Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "8d9503d3-1c09-444a-94ac-a12fb6a014a9",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "INFO",
    "time": "2024-02-21T05:58:45.711Z",
    "logger": "controller.nodeclaim.lifecycle",
    "message": "launched nodeclaim",
    "commit": "17d6c05",
    "nodeclaim": "standard-instance-store-x6wxs",
    "provider-id": "aws:///eu-north-1c/i-078e295d1e5549ea3",
    "instance-type": "i3.4xlarge",
    "zone": "eu-north-1c",
    "capacity-type": "spot",
    "allocatable": {
        "cpu": "15640m",
        "ephemeral-storage": "3412483807232",
        "memory": "112429Mi",
        "pods": "234"
    }
}
{
    "level": "ERROR",
    "time": "2024-02-21T06:07:44.710Z",
    "logger": "controller.disruption.queue",
    "message": "failed to disrupt nodes, command reached timeout after 10m5.783616799s; waiting for replacement initialization, nodeclaim standard-instance-store-x6wxs not initialized",
    "commit": "17d6c05",
    "command-id": "02d6d2c0-c823-410a-a3cb-1f6479bc2b3c",
    "nodes": "ip-10-209-146-248.eu-north-1.compute.internal"
}

The EC2 node in question in kubernetes:

Name:               ip-10-209-146-79.eu-north-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=i3.4xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-north-1
                    failure-domain.beta.kubernetes.io/zone=eu-north-1c
                    k8s.io/cloud-provider-aws=b7aae9ddc981b649535117c46866cfc4
                    karpenter.k8s.aws/instance-category=i
                    karpenter.k8s.aws/instance-cpu=16
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=false
                    karpenter.k8s.aws/instance-family=i3
                    karpenter.k8s.aws/instance-generation=3
                    karpenter.k8s.aws/instance-hypervisor=xen
                    karpenter.k8s.aws/instance-local-nvme=3800
                    karpenter.k8s.aws/instance-memory=124928
                    karpenter.k8s.aws/instance-network-bandwidth=5000
                    karpenter.k8s.aws/instance-size=4xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/nodepool=standard-instance-store
                    karpenter.sh/registered=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-209-146-79.eu-north-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=i3.4xlarge
                    topology.kubernetes.io/region=eu-north-1
                    topology.kubernetes.io/zone=eu-north-1c
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.209.146.79
                    karpenter.k8s.aws/ec2nodeclass-hash: 14690241518068856330
                    karpenter.sh/managed-by: X
                    karpenter.sh/nodepool-hash: 9268174783651286961
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 21 Feb 2024 06:59:20 +0100
Taints:             node.cilium.io/agent-not-ready=true:NoExecute
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-209-146-79.eu-north-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 21 Feb 2024 08:48:35 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:20 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:20 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:20 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:36 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.209.146.79
  InternalDNS:  ip-10-209-146-79.eu-north-1.compute.internal
  Hostname:     ip-10-209-146-79.eu-north-1.compute.internal
Capacity:
  cpu:                16
  ephemeral-storage:  3708852832Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             125680016Ki
  pods:               234
Allocatable:
  cpu:                15640m
  ephemeral-storage:  3410562571544
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             122475920Ki
  pods:               234
System Info:
  Machine ID:                 147fb4aeea144e9b81c7d74e1385102f
  System UUID:                ec201972-0cf5-7dcf-239e-bb62c07f1bed
  Boot ID:                    92c15038-7ea5-49be-b6d3-65739f260338
  Kernel Version:             5.10.209-198.812.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.11
  Kubelet Version:            v1.27.9-eks-5e0fdde
  Kube-Proxy Version:         v1.27.9-eks-5e0fdde
ProviderID:                   aws:///eu-north-1c/i-078e295d1e5549ea3
Non-terminated Pods:          (3 in total)
  Namespace                   Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                              ------------  ----------  ---------------  -------------  ---
  kube-system                 cilium-jlwk9                      100m (0%)     0 (0%)      10Mi (0%)        0 (0%)         109m
  kube-system                 ebs-csi-node-xg9w9                30m (0%)      0 (0%)      120Mi (0%)       768Mi (0%)     109m
  secrets-store-csi-driver    secrets-store-csi-driver-4xc6s    70m (0%)      500m (3%)   140Mi (0%)       400Mi (0%)     109m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (1%)   500m (3%)
  memory             270Mi (0%)  1168Mi (0%)
  ephemeral-storage  420Mi (0%)  0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

Note that we are using Cilium as the CNI. Cilium will in normal operations remove the taint node.cilium.io/agent-not-ready on the node once the cilium-agent is running on the node. The Cilium operator attempts to attach an additional ENI on the host via ec2:AttachNetworkInterface. AWS Audit log entry below, notice the errorMessage:

{
    "errorCode": "Client.IncorrectState",
    "eventSource": [
      "ec2.amazonaws.com"
    ],
    "errorMessage.keyword": [
      "Instance 'i-078e295d1e5549ea3' is not 'running' or 'stopped'."
    ],
    "eventTime": [
      "2024-02-21T05:59:42.000Z"
    ],
    "errorCode.keyword": [
      "Client.IncorrectState"
    ],
    "requestParameters.instanceId": [
      "i-078e295d1e5549ea3"
    ],
    "errorMessage": [
      "Instance 'i-078e295d1e5549ea3' is not 'running' or 'stopped'."
    ],
}

The strange thing is that the Pending instance seems to be, working, kinda. Pods that use hostNetwork:true are able to run on this instance, and they seem to work. Kubelet is reporting that it is ready. Fetching logs from a pod running on the node fails though: Error from server: Get "https://10.209.146.79:10250/containerLogs/kube-system/cilium-operator-5695bfbb6b-gm9ch/cilium-operator": remote error: tls: internal error

Expected Behavior: I'm not really sure to be honest. The NodeClaim is stuck in Ready:false as Cilium is not removing the taints, as the operator is not able to attach an ENI to the instance. As the EC2 API reports the instance as Pending, I would expect karpenter to mark the node as failed/not working a remove it.

So what I think should happen, is that karpenter would mark EC2 nodes that are in state Pending for more than 15minutes, to be marked as not ready and decommissioned

Reproduction Steps (Please include YAML):

Versions:

toredash commented 8 months ago

I just realised that the TTL for a node to bootstrap is set to 15minutes: https://github.com/kubernetes-sigs/karpenter/blob/46d3d646ea3784a885336b9c40fd22f406601441/pkg/controllers/nodeclaim/lifecycle/liveness.go#L40

This becomes interesting if we did not use Cilium: Would EKS attempt to schedule pods to this non-functioning node ?

jmdeal commented 8 months ago

You're right, if the kubelet was reporting the node as healthy, even if Karpenter didn't consider it ready, the kube-scheduler would and would begin to schedule pods to it. Karpenter could do periodic health checks on nodes, though you could run into the issue of non-disruptible pods scheduling to the unhealthy nodes. If they are in-fact able to run in some cases I don't think Karpenter could safely disrupt the node. Thinking aloud, I'm wondering if it could make sense for Karpenter to apply a startup taint to nodes that it doesn't remove until Karpenter thinks the nodes are ready, with one of those conditions being that the node is in a running state.

jonathan-innis commented 8 months ago

I just realised that the TTL for a node to bootstrap is set to 15minutes

The TTL is only set to 15 minutes for nodes that never actually join the cluster; however, it sounds like you are seeing that nodes do join the cluster and just that the node never goes into a ready state due to other hardware issues.

It seems like there is a lot of overlap here with https://github.com/kubernetes-sigs/karpenter/issues/750 which we are tracking more directly. I'm going to close this issue in favor of that one. I'd encourage you to go check out that issue, +1 it, and see if there is any additional content you think would be relevant to the discussion there as we're thinking about how to solve this problem.

We're currently prioritizing that issue in the v1.x backlog, so we see it as a high priority but don't plan on hitting it until a bit after v1. We haven't heard a lot of users hitting consistent issues with EC2 startup; I'd be curious to hear why this is happening so frequently and if this is something that we can push EC2 to solve through support since I wouldn't expect that you would be experiencing so much failure.

toredash commented 8 months ago

The TTL is only set to 15 minutes for nodes that never actually join the cluster; however, it sounds like you are seeing that nodes do join the cluster and just that the node never goes into a ready state due to other hardware issues.

Thats correct understood. The EC2 instance does indeed start, the kubelet joins the cluster and reports healthy, and karpenter does in fact consider it as working. But the node itself does lever leave the EC2 State: Pending.

When this occurred, we did not have remote shell activated to see if at what level the instance what actually experiencing any hardware issues. This is not present, so if we see this again I'll re-open the issue if there is anything relevant to share.

It seems like there is a lot of overlap here with kubernetes-sigs/karpenter#750 which we are tracking more directly. I'm going to close this issue in favor of that one. I'd encourage you to go check out that issue, +1 it, and see if there is any additional content you think would be relevant to the discussion there as we're thinking about how to solve this problem.

I'm actually not sure if #750 is covering the use-case in my reported issue, as it seems that the use-cases there are for nodes that does not report Ready. My nodes did indeed report Ready, but the node had a startup-taint that was not removed, as the EC2 instance did have hardware issues making it unable to attach another ENI.

We're currently prioritizing that issue in the v1.x backlog, so we see it as a high priority but don't plan on hitting it until a bit after v1. We haven't heard a lot of users hitting consistent issues with EC2 startup; I'd be curious to hear why this is happening so frequently and if this is something that we can push EC2 to solve through support since I wouldn't expect that you would be experiencing so much failure.

Well HW issues occur, so I don' think support can provide any details. The company is AWS Enterprise customer, and the TAM has been informed about this issue.

toredash commented 8 months ago

@jonathan-innis We had another occurrence of hw issue on an a EC2 instance.

We managed to establish an SSM session towards the instance. So from a networking perspective, this node seems to be working. The kubelet init script runs and all seems fine.

Noticeable errors I've found:

5740 log.go:194] http: TLS handshake error from 10.209.136.224:60536: no serving certificate available for the kubelet

That is the IP for the EKS control plane. It seems the kubelet is unable to obtain the certificate? image

That's strange, why would the kubelet report as working when it is ... not?

[root@ip-10-209-138-36 log]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubelet-args.conf, 30-kubelet-extra-args.conf
   Active: active (running) since Tue 2024-02-27 09:26:29 UTC; 3h 9min ago
     Docs: https://github.com/kubernetes/kubernetes
  Process: 5730 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
 Main PID: 5740 (kubelet)
    Tasks: 24
   Memory: 101.6M
   CGroup: /runtime.slice/kubelet.service
           └─5740 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-c...

Feb 27 12:35:39 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:39.313837    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33060: no serving certificate available for the kubelet
Feb 27 12:35:40 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:40.358469    5740 log.go:194] http: TLS handshake error from 10.254.53.81:49352: no serving certificate available for the kubelet
Feb 27 12:35:40 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:40.367121    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33070: no serving certificate available for the kubelet
Feb 27 12:35:40 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:40.890979    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33080: no serving certificate available for the kubelet
Feb 27 12:35:41 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:41.293746    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33092: no serving certificate available for the kubelet
Feb 27 12:35:42 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:42.442106    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33100: no serving certificate available for the kubelet
Feb 27 12:35:43 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:43.144907    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33102: no serving certificate available for the kubelet
Feb 27 12:35:44 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:44.442592    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33112: no serving certificate available for the kubelet
Feb 27 12:35:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:45.135552    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33126: no serving certificate available for the kubelet
Feb 27 12:35:46 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:35:46.047502    5740 log.go:194] http: TLS handshake error from 10.209.136.224:33134: no serving certificate available for the kubelet

I manually approved the certificate, and the kubelet proceeds:

% kubectl certificate approve csr-f69z5
certificatesigningrequest.certificates.k8s.io/csr-f69z5 approved
[...]
Feb 27 12:37:44 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:44.447377    5740 log.go:194] http: TLS handshake error from 10.209.136.224:55130: no serving certificate available for the kubelet
Feb 27 12:37:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:45.331070    5740 log.go:194] http: TLS handshake error from 10.209.136.224:55146: no serving certificate available for the kubelet
Feb 27 12:37:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:45.510453    5740 csr.go:261] certificate signing request csr-f69z5 is approved, waiting to be issued
Feb 27 12:37:45 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:45.523411    5740 csr.go:257] certificate signing request csr-f69z5 is issued
Feb 27 12:37:46 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:46.524676    5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Certificate expiration is 2025-02-26 12:33:00 +0000 UTC, rotation deadline is 2024-12-20 07:46:46.160553788 +0000 UTC
Feb 27 12:37:46 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:46.524707    5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Waiting 7123h8m59.635850965s for next certificate rotation
Feb 27 12:37:47 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:47.524783    5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Certificate expiration is 2025-02-26 12:33:00 +0000 UTC, rotation deadline is 2024-11-13 06:20:45.61150787 +0000 UTC
Feb 27 12:37:47 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:47.524816    5740 certificate_manager.go:356] kubernetes.io/kubelet-serving: Waiting 6233h42m58.086694399s for next certificate rotation
Feb 27 12:37:53 ip-10-209-138-36.eu-north-1.compute.internal kubelet[5740]: I0227 12:37:53.056522    5740 scope.go:117] "RemoveContainer" containerID="b812882bb855e0051fe64eb5ee63518ed02c147e865b82b5f94b92a5a71f9664"

I don't get why this happens randomly. I'm assuming the hardware error/issue is real, but the fact that kubelet reports OK when it is clearly not seems like an EKS issue.

Karpenter is still un-aware that this node is not working at all. The instance state is still Pending, and the API call ec2:AttachNetworkInterface continues to fail since the state is not Running|Stopped.

toredash commented 8 months ago

I'e created a AWS Enterprise support case on the matter: 170841858601906

tony-engineering commented 7 months ago

hi @toredash , any news here ?

toredash commented 7 months ago

hi @toredash , any news here ?

Yes, AWS Support said this is working correctly as is, and this issue should be filed against https://github.com/kubernetes/cloud-provider-aws if I believe this is an issue

toredash commented 7 months ago

@jonathan-innis Is there a chance to revisit this issue?

I would argue that Karpenter should only consider a node for fully joined if the Kubelet is returning Ready and the EC2 instance is in state Running. As it stands now Karpenter is not aware that an EC2 instance could have underlying issues, which would be identified by an instance not transitioning from pending to running state.

I'm not that familiar with Golang, so I'm not sure where the logic for this should be placed. Could this be an enhancement of the init process? // Reconcile checks for initialization based on if: // a) its current status is set to Ready // b) all the startup taints have been removed from the node // c) all extended resources have been registered // This method handles both nil nodepools and nodes without extended resources gracefully.