karpenter pods CrashLoopBackOff - Readiness probe failed read: connection reset by peer / Liveness probe failed connect: connection refused

iamsaurabhgupt commented 1 month ago

Description

Observed Behavior:

kubectl get pods -n kube-system

NAME READY STATUS RESTARTS AGE aws-node-dc9hb 2/2 Running 0 109m aws-node-pzbww 2/2 Running 0 109m coredns-789f8477df-8r5zd 1/1 Running 0 114m coredns-789f8477df-tc5pt 1/1 Running 0 114m eks-pod-identity-agent-gqwrz 1/1 Running 0 109m eks-pod-identity-agent-sbng9 1/1 Running 0 109m karpenter-df9d8f6dd-xbz9d 0/1 Running 0 118s karpenter-df9d8f6dd-znnjw 0/1 Pending 0 118s kube-proxy-l8bcp 1/1 Running 0 109m kube-proxy-mnw6n 1/1 Running 0 109m

kubectl describe pod karpenter-df9d8f6dd-xbz9d -n kube-system Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: aws-iam-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 86400 kube-api-access-n9sbj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Guaranteed Node-Selectors: kubernetes.io/os=linux Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message

Normal Scheduled 3m15s default-scheduler Successfully assigned kube-system/karpenter-df9d8f6dd-xbz9d to ip-10-110-164-199.ec2.internal Normal Pulled 75s (x2 over 3m15s) kubelet Container image "public.ecr.aws/karpenter/controller:1.0.5@sha256:f2df98735b232b143d37f0c6819a6cae2be4740e3c8b38297bceb365cf3f668b" already present on machine Normal Created 75s (x2 over 3m15s) kubelet Created container controller Normal Killing 75s kubelet Container controller failed liveness probe, will be restarted Warning Unhealthy 75s kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": read tcp 10.xxx.1x4.1x9:33238->10.xxx.1x5.1x3:8081: read: connection reset by peer Warning Unhealthy 75s (x2 over 75s) kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": dial tcp 10.xxx.1x5.153:8081: connect: connection refused Normal Started 74s (x2 over 3m14s) kubelet Started container controller Warning Unhealthy 5s (x5 over 2m35s) kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 5s (x4 over 2m15s) kubelet Liveness probe failed: Get "http://10.xxx.1x5.153:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Expected Behavior: karpenter pod must get to Running stage

Reproduction Steps (Please include YAML): EKS cluster version 1.31 created using followed both https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/ and https://karpenter.sh/docs/getting-started/migrating-from-cas/ but nothing worked

eksctl create cluster -f - <<EOF apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: xxxxx region: us-east-1 version: "1.31" tags: karpenter.sh/discovery: xxxxx

privateCluster: enabled: true skipEndpointCreation: true

iam: withOIDC: true podIdentityAssociations:

namespace: "kube-system" serviceAccountName: karpenter roleName: xxxx-karpenter permissionPolicyARNs:
- arn:aws:iam::xxxxxxxx:policy/KarpenterControllerPolicy-xxxx

iamIdentityMappings:

arn: "arn:aws:iam::xxxxxxx:role/KarpenterNodeRole-xxxx" username: system:node:{{EC2PrivateDNSName}} groups:
- system:bootstrappers
- system:nodes

managedNodeGroups:

instanceType: m5d.large amiFamily: AmazonLinux2 name: xxxxx-ng desiredCapacity: 2 minSize: 1 maxSize: 10 privateNetworking: true

addons:

name: eks-pod-identity-agent
name: coredns
name: vpc-cni
name: kube-proxy EOF

KARPENTER_VERSION = 1.0.5 (tried 1.0.6 as well but didn't work) helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \ --set "settings.clusterName=${CLUSTER_NAME}" \ --set "settings.interruptionQueue=${CLUSTER_NAME}" \ --set "settings.isolatedVPC=true" \ --set controller.resources.requests.cpu=1 \ --set controller.resources.requests.memory=1Gi \ --set controller.resources.limits.cpu=1 \ --set controller.resources.limits.memory=1Gi \ --wait

tried dnsPolicy=Default but didn't work kubectl logs karpenter-df9d8f6dd-xbz9d -n kube-system {"level":"DEBUG","time":"2024-10-21T00:04:42.255Z","logger":"controller","caller":"operator/operator.go:149","message":"discovered karpenter version","commit":"652e6aa","version":"1.0.5"}

kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' -n 100 No resources found

tried DISABLE_WEBHOOK=true but didn't work as well

Versions:

Chart Version: 1.0.5 and 1.0.6 both don't work
Kubernetes Version (kubectl version): 1.31
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

njtran commented 4 weeks ago

There should be more logs than "discovered karpenter version", can you post those?

pkit commented 3 weeks ago

@njtran there should be, but there isn't It literally means that liveness probe fails out of the blue for no reason. And then perfectly working pod is restarted. It looks like it's a problem from #6637 or #2186 but it's not clear why it fails still. In my case the shit started when I upgraded from 0.36.2 to 0.37.5 And yes, I use fargate.

iamsaurabhgupt commented 3 weeks ago

how we fixed it:

increase liveness probe to 1200s to avoid premature crashes without logs (it should actually be default on start)
when we got actual logs, we saw the actual issue was authorization error.
on digging more, we found some VPN/security group was rejecting the access. once allowed, it got fixed.

Recommendation to maintainers:

increase liveness/readiness initialDelaySeconds on helm chart

      livenessProbe:
        initialDelaySeconds: 600
        timeoutSeconds: 300
        httpGet:
          path: /healthz
          port: http
      readinessProbe:
        initialDelaySeconds: 540
        timeoutSeconds: 300
        httpGet:
          path: /readyz
          port: http

pkit commented 3 weeks ago

LOL. So need to re-deploy it manually just to find out what the error is? Nice! Thanks!

qspors commented 3 weeks ago

how we fixed it:

increase liveness/readiness initialDelaySeconds on helm chart

          livenessProbe:
            initialDelaySeconds: 600
            timeoutSeconds: 300
            httpGet:
              path: /healthz
              port: http
          readinessProbe:
            initialDelaySeconds: 540
            timeoutSeconds: 300
            httpGet:
              path: /readyz
              port: http

Have same issues. regarding this changes ? I have check values.yaml for last release. does't see this settings inside values.yaml this settings is not templated for values.

Did you change it directly thru deployment settings or you added changes to chart and install it. Cuz I do deployment for karpenter thru terraform with fargate profile and able to add changes only values existing in values.yaml

brandonphan commented 2 weeks ago

how we fixed it:

increase liveness/readiness initialDelaySeconds on helm chart
          livenessProbe:
            initialDelaySeconds: 600
            timeoutSeconds: 300
            httpGet:
              path: /healthz
              port: http
          readinessProbe:
            initialDelaySeconds: 540
            timeoutSeconds: 300
            httpGet:
              path: /readyz
              port: http
Have same issues. regarding this changes ? I have check values.yaml for last release. does't see this settings inside values.yaml this settings is not templated for values.

Did you change it directly thru deployment settings or you added changes to chart and install it. Cuz I do deployment for karpenter thru terraform with fargate profile and able to add changes only values existing in values.yaml

@qspors Ended up using the solution for debugging and edited the deployment directly since I couldn't find any configuration in the helm chart.

pkit commented 1 week ago

Okay. In my case it was a simple OOM. I.e. the default fargate_profile of 0.25cpu/0.5Gi RAM is not working for me starting from 0.37.5 (worked fine for <=0.34) It was almost impossible to catch, only by chance I saw the error. The solution was to set RAM to 1Gi in the helm chart values (terraform with aws-ia/eks-blueprints-addons/aws):

  karpenter = {
    chart_version = "0.37.5"
    values = [
      <<-EOT
        controller:
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 1000m
              memory: 1Gi
      EOT
    ]
  }

aws / karpenter-provider-aws

karpenter pods CrashLoopBackOff - Readiness probe failed read: connection reset by peer / Liveness probe failed connect: connection refused #7256

Description