aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.86k stars 967 forks source link

Node are reported with status "Unknown" before being decommissioned #7412

Open AppliNH opened 5 days ago

AppliNH commented 5 days ago

Description

Observed Behavior:

Since the upgrade to Karpenter 1.0.8, nodes being decommissioned are marked with an Unknown status for ~1min by the kube_node_status_condition Kubernetes metrics.

This wasn't the case before.

The exact PromQL we're using is count by (node) (kube_node_status_condition{status="unknown"} == 1) > 0

image

We've observed the following in the kubernetes-event-exporter logs for the nodes affected by this behavior:

"reason":"InstanceTerminating",
"message":"Instance is terminating"
  "reason": "NodeNotReady",
  "message": "Node node-placeholder status is now: NodeNotReady",
"reason":"RemovingNode",
"message":"Node XXX event: Removing Node XXX from Controller"

Expected Behavior:

Nodes being decommissioned should not have Unknown status

Reproduction Steps (Please include YAML):

karpenter and karpenter-crd are deployed through Terraform:

module "addons" {
  source            = "aws-ia/eks-blueprints-addons/aws"
  version           = "v1.16.3"
  ...
  enable_karpenter = true
  karpenter = {
    chart_version = "1.0.8"
    values        = [templatefile("${path.module}/karpenter-values.yaml", {})]
  }
...
}
resource "helm_release" "karpenter_crd" {
  name       = "karpenter-crd"
  namespace  = "karpenter"
  repository = "oci://public.ecr.aws/karpenter"
  version    = "1.0.8"
  chart      = "karpenter-crd"
}

karpenter-values.yaml

tolerations:
- key: dedicated
  operator: Equal
  value: orchestration
  effect: NoSchedule
nodeSelector:
  purpose: orchestration
podLabels:
  part-of: karpenter
  team: sre
serviceMonitor:
  enabled: true
controller:
  resources:
    requests:
      cpu: 1
      memory: 3Gi
    limits:
     memory: 3Gi
settings:
  aws:
    defaultInstanceProfile: ...

Versions: