Open iamsaurabhgupt opened 1 month ago
There should be more logs than "discovered karpenter version", can you post those?
@njtran there should be, but there isn't It literally means that liveness probe fails out of the blue for no reason. And then perfectly working pod is restarted. It looks like it's a problem from #6637 or #2186 but it's not clear why it fails still. In my case the shit started when I upgraded from 0.36.2 to 0.37.5 And yes, I use fargate.
how we fixed it:
Recommendation to maintainers:
livenessProbe:
initialDelaySeconds: 600
timeoutSeconds: 300
httpGet:
path: /healthz
port: http
readinessProbe:
initialDelaySeconds: 540
timeoutSeconds: 300
httpGet:
path: /readyz
port: http
LOL. So need to re-deploy it manually just to find out what the error is? Nice! Thanks!
how we fixed it:
- increase liveness/readiness initialDelaySeconds on helm chart
livenessProbe: initialDelaySeconds: 600 timeoutSeconds: 300 httpGet: path: /healthz port: http readinessProbe: initialDelaySeconds: 540 timeoutSeconds: 300 httpGet: path: /readyz port: http
Have same issues. regarding this changes ? I have check values.yaml for last release. does't see this settings inside values.yaml this settings is not templated for values.
Did you change it directly thru deployment settings or you added changes to chart and install it. Cuz I do deployment for karpenter thru terraform with fargate profile and able to add changes only values existing in values.yaml
how we fixed it:
- increase liveness/readiness initialDelaySeconds on helm chart
livenessProbe: initialDelaySeconds: 600 timeoutSeconds: 300 httpGet: path: /healthz port: http readinessProbe: initialDelaySeconds: 540 timeoutSeconds: 300 httpGet: path: /readyz port: http
Have same issues. regarding this changes ? I have check values.yaml for last release. does't see this settings inside values.yaml this settings is not templated for values.
Did you change it directly thru deployment settings or you added changes to chart and install it. Cuz I do deployment for karpenter thru terraform with fargate profile and able to add changes only values existing in values.yaml
@qspors Ended up using the solution for debugging and edited the deployment directly since I couldn't find any configuration in the helm chart.
Okay. In my case it was a simple OOM.
I.e. the default fargate_profile
of 0.25cpu/0.5Gi RAM is not working for me starting from 0.37.5 (worked fine for <=0.34)
It was almost impossible to catch, only by chance I saw the error.
The solution was to set RAM to 1Gi in the helm chart values (terraform with aws-ia/eks-blueprints-addons/aws
):
karpenter = {
chart_version = "0.37.5"
values = [
<<-EOT
controller:
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 1000m
memory: 1Gi
EOT
]
}
Description
Observed Behavior:
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE aws-node-dc9hb 2/2 Running 0 109m aws-node-pzbww 2/2 Running 0 109m coredns-789f8477df-8r5zd 1/1 Running 0 114m coredns-789f8477df-tc5pt 1/1 Running 0 114m eks-pod-identity-agent-gqwrz 1/1 Running 0 109m eks-pod-identity-agent-sbng9 1/1 Running 0 109m karpenter-df9d8f6dd-xbz9d 0/1 Running 0 118s karpenter-df9d8f6dd-znnjw 0/1 Pending 0 118s kube-proxy-l8bcp 1/1 Running 0 109m kube-proxy-mnw6n 1/1 Running 0 109m
kubectl describe pod karpenter-df9d8f6dd-xbz9d -n kube-system
Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: aws-iam-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 86400 kube-api-access-n9sbj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional:Normal Scheduled 3m15s default-scheduler Successfully assigned kube-system/karpenter-df9d8f6dd-xbz9d to ip-10-110-164-199.ec2.internal Normal Pulled 75s (x2 over 3m15s) kubelet Container image "public.ecr.aws/karpenter/controller:1.0.5@sha256:f2df98735b232b143d37f0c6819a6cae2be4740e3c8b38297bceb365cf3f668b" already present on machine Normal Created 75s (x2 over 3m15s) kubelet Created container controller Normal Killing 75s kubelet Container controller failed liveness probe, will be restarted Warning Unhealthy 75s kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": read tcp 10.xxx.1x4.1x9:33238->10.xxx.1x5.1x3:8081: read: connection reset by peer Warning Unhealthy 75s (x2 over 75s) kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": dial tcp 10.xxx.1x5.153:8081: connect: connection refused Normal Started 74s (x2 over 3m14s) kubelet Started container controller Warning Unhealthy 5s (x5 over 2m35s) kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 5s (x4 over 2m15s) kubelet Liveness probe failed: Get "http://10.xxx.1x5.153:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Expected Behavior: karpenter pod must get to Running stage
Reproduction Steps (Please include YAML): EKS cluster version 1.31 created using followed both https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/ and https://karpenter.sh/docs/getting-started/migrating-from-cas/ but nothing worked
eksctl create cluster -f - <<EOF apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: xxxxx region: us-east-1 version: "1.31" tags: karpenter.sh/discovery: xxxxx
privateCluster: enabled: true skipEndpointCreation: true
iam: withOIDC: true podIdentityAssociations:
iamIdentityMappings:
managedNodeGroups:
addons:
KARPENTER_VERSION = 1.0.5 (tried 1.0.6 as well but didn't work) helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \ --set "settings.clusterName=${CLUSTER_NAME}" \ --set "settings.interruptionQueue=${CLUSTER_NAME}" \ --set "settings.isolatedVPC=true" \ --set controller.resources.requests.cpu=1 \ --set controller.resources.requests.memory=1Gi \ --set controller.resources.limits.cpu=1 \ --set controller.resources.limits.memory=1Gi \ --wait
tried dnsPolicy=Default but didn't work
kubectl logs karpenter-df9d8f6dd-xbz9d -n kube-system
{"level":"DEBUG","time":"2024-10-21T00:04:42.255Z","logger":"controller","caller":"operator/operator.go:149","message":"discovered karpenter version","commit":"652e6aa","version":"1.0.5"}kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' -n 100
No resources foundtried DISABLE_WEBHOOK=true but didn't work as well
Versions:
Chart Version: 1.0.5 and 1.0.6 both don't work
Kubernetes Version (
kubectl version
): 1.31Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment