aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.58k stars 915 forks source link

v0.37.0 upgrade on Kubernetes 1.30 - controller pods failing probes #6580

Closed s-sy-y closed 1 month ago

s-sy-y commented 1 month ago

Description

Observed Behavior:

When installing v0.37.0 chart with ArgoCD into "karpenter" namespace, nothing happens on the controller pods apart from the one line seen below. Configuration has worked on new v1beta API's before (on v0.32.2), currently deploying v0.37.0 from scratch on a new cluster.

Also, there's some mismatch values with chart releases that are also confusing. For example, v0.37.0 has container tag + digest pointed to v0.36.0 which according to compatibility matrix is not supported on 1.30 at all. Unclear if this migration is even possible.

Logs (debug) {"level":"DEBUG","time":"2024-07-24T11:48:58.885Z","logger":"controller","message":"discovered karpenter version","commit":"490ef94","version":"0.37.0"}

Events Warning Unhealthy 5m36s (x5 over 9m36s) kubelet Liveness probe failed: Get "http://10.39.70.196:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Expected Behavior:

Probes not failing.

Reproduction Steps (Please include YAML):

Source

repository: "https://github.com/aws/karpenter"
target_revision: "v0.37.0"
path: "charts/karpenter"

Values

replicas: 2
additionalAnnotations:
  ad.datadoghq.com/service.checks: |
    {"karpenter": {"init_config": {},"instances": [{"openmetrics_endpoint": "<REDACTED>"}]}}
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: <REDACTED>
revisionHistoryLimit: 0
controller:
  image:
    repository: public.ecr.aws/karpenter/controller
    tag: v0.37.0
    digest: sha256:157f478f5db1fe999f5e2d27badcc742bf51cc470508b3cebe78224d0947674f
settings:
  clusterName: <REDACTED>
  clusterEndpoint: "https://kubernetes.default.svc"
  interruptionQueueName: <REDACTED>
  isolatedVPC: true
logLevel: debug

Versions:

s-sy-y commented 1 month ago

I noticed that with a bit of a delay, the controller also reported errors for sts (ec2 api connectivity check failed). I managed to get controller pods healthy by setting dnsPolicy: Default in the chart values file as I'm running it on Fargate.

On the mismatch topic, I see there's some OCI references instead of using chart from Git directly. Does this mean that OCI approach will be mandatory and chart in git will stop getting updates at some point?

engedaam commented 1 month ago

(ec2 api connectivity check failed)

Can you share the service account in your cluster? This usually indicates a permission issue. Have followed the FAQ steps? https://karpenter.sh/docs/troubleshooting/#failed-resolving-sts-credentials-with-io-timeout

s-sy-y commented 1 month ago

Setting the dnsPolicy to Default actually fixed this already.

engedaam commented 1 month ago

On the mismatch topic, I see there's some OCI references instead of using chart from Git directly. Does this mean that OCI approach will be mandatory and chart in git will stop getting updates at some point?

Here is an issue on this topic on dropping chart support https://github.com/aws/karpenter-provider-aws/issues/5847