aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.83k stars 960 forks source link

Webhooks issue, following guides #5133

Closed k-walsh-gmg closed 12 months ago

k-walsh-gmg commented 12 months ago

Description

Observed Behavior: When attempting to create a node pool it fails giving me a webhooks issue: Error from server (InternalError): error when creating "../Buckeye(us-east-2)provisioner/default-nodepool.yaml": Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: Post "https://karpenter.karpenter.svc:8443/?timeout=10s": no endpoints available for service "karpenter" Error from server (InternalError): error when creating "../Buckeye(us-east-2)provisioner/default-nodepool.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.karpenter.k8s.aws": failed to call webhook: Post "https://karpenter.karpenter.svc:8443/?timeout=10s": no endpoints available for service "karpenter"

Expected Behavior: webhooks to work Reproduction Steps (Please include YAML): After following instructions here, it is impossible to deploy any node pools: https://karpenter.sh/preview/getting-started/migrating-from-cas/

Source: karpenter/templates/poddisruptionbudget.yaml

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: karpenter namespace: karpenter labels: helm.sh/chart: karpenter-v0.32.1 app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter app.kubernetes.io/version: "0.32.1" app.kubernetes.io/managed-by: Helm spec: maxUnavailable: 1 selector: matchLabels: app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter

Source: karpenter/templates/serviceaccount.yaml

apiVersion: v1 kind: ServiceAccount metadata: name: karpenter namespace: karpenter labels: helm.sh/chart: karpenter-v0.32.1 app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter app.kubernetes.io/version: "0.32.1" app.kubernetes.io/managed-by: Helm annotations: eks.amazonaws.com/role-arn: arn:aws:iam::{HIDDENACCOUNT}:role/KarpenterControllerRole-buckeye

Source: karpenter/templates/secret-webhook-cert.yaml

apiVersion: v1 kind: Secret metadata: name: karpenter-cert namespace: karpenter labels: helm.sh/chart: karpenter-v0.32.1 app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter app.kubernetes.io/version: "0.32.1" app.kubernetes.io/managed-by: Helm

data: {} # Injected by karpenter-webhook


Source: karpenter/templates/configmap-logging.yaml

apiVersion: v1 kind: ConfigMap metadata: name: config-logging namespace: karpenter labels: helm.sh/chart: karpenter-v0.32.1 app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter app.kubernetes.io/version: "0.32.1" app.kubernetes.io/managed-by: Helm data:

https://github.com/uber-go/zap/blob/aa3e73ec0896f8b066ddf668597a02f89628ee50/config.go

zap-logger-config: | { "level": "debug", "development": false, "disableStacktrace": true, "disableCaller": true, "sampling": { "initial": 100, "thereafter": 100 }, "outputPaths": ["stdout"], "errorOutputPaths": ["stderr"], "encoding": "json", "encoderConfig": { "timeKey": "time", "levelKey": "level", "nameKey": "logger", "callerKey": "caller", "messageKey": "message", "stacktraceKey": "stacktrace", "levelEncoder": "capital", "timeEncoder": "iso8601" } } loglevel.controller: debug loglevel.webhook: error

Source: karpenter/templates/configmap.yaml

apiVersion: v1 kind: ConfigMap metadata: name: karpenter-global-settings namespace: karpenter labels: helm.sh/chart: karpenter-v0.32.1 app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter app.kubernetes.io/version: "0.32.1" app.kubernetes.io/managed-by: Helm data: batchMaxDuration: "10s" batchIdleDuration: "1s"

Source: karpenter/templates/aggregate-clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: karpenter-admin labels: rbac.authorization.k8s.io/aggregate-to-admin: "true" helm.sh/chart: karpenter-v0.32.1 app.kubernetes.io/name: karpenter app.kubernetes.io/instance: karpenter app.kubernetes.io/version: "0.32.1" app.kubernetes.io/managed-by: Helm rules:

Versions:

jonathan-innis commented 12 months ago

Are the Karpenter pods on your cluster up and running?

k-walsh-gmg commented 12 months ago

No actually, but not totally sure why at this point they are not. They had been running earlier but now they are in a crashloop

jonathan-innis commented 12 months ago

They had been running earlier but now they are in a crashloop

I'd suspect that's why you are getting timeouts from the webhooks, since the lack of the pod running means that the backend for the service is going to be down.

k-walsh-gmg commented 12 months ago

got the pod running and seeing this : {"level":"ERROR","time":"2023-11-21T23:42:17.823Z","logger":"webhook.ConfigMapWebhook","message":"Reconcile error","commit":"1072d3b","knative.dev/traceid":"92378627-1dc2-4135-9094-c492b94b8ece","knative.dev/key":"kube-system/karpenter-cert","duration":"71.170423ms","error":"failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.config.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}

jonathan-innis commented 12 months ago

AFAIK, that error is expected because there can be conflicts with reconciling the certificate on startup. Is there something else on the pod that is causing the CrashLoopBackoff?

k-walsh-gmg commented 12 months ago

As stated previously I was able to get karpenter up in running in a good state(to include the controller container), I got the above error after they were running. After Scaling the karpenter deployment to 0 and back to 2 the above no longer appears. Currently karpenter is deployed and creates instances but they never join the cluster(and keeps attempting to boot up instances)

jonathan-innis commented 12 months ago

Sounds good. So it sounds like the webhook issue that you were seeing originally is resolved? Can I close this issue?

k-walsh-gmg commented 12 months ago

Sure!

On Tue, Nov 21, 2023, 8:11 PM Jonathan Innis @.***> wrote:

Sounds good. So it sounds like the webhook issue that you were seeing originally is resolved? Can I close this issue?

— Reply to this email directly, view it on GitHub https://github.com/aws/karpenter/issues/5133#issuecomment-1821935296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXJIKCTK5WNYNOIXZEOGLDTYFVGNRAVCNFSM6AAAAAA7VAXQAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRHEZTKMRZGY . You are receiving this because you authored the thread.Message ID: @.***>

rithikb24 commented 7 months ago

hi @jonathan-innis i am continuously seeing these reconcile errors after upgrading from 0.27 to 0.31.2.


2024-03-26T08:08:12.562Z    DEBUG   Successfully created the logger.
2024-03-26T08:08:12.563Z    DEBUG   Logging level set to: debug
{"level":"info","ts":1711440492.5673463,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2024-03-26T08:08:12.668Z    DEBUG   controller  waiting for configmaps  {"commit": "dc3af1a"}
2024-03-26T08:08:13.495Z    DEBUG   controller  waiting for configmaps  {"commit": "dc3af1a"}
2024-03-26T08:08:14.003Z    DEBUG   controller  waiting for configmaps  {"commit": "dc3af1a"}
2024-03-26T08:08:14.511Z    DEBUG   controller  waiting for configmaps  {"commit": "dc3af1a"}
2024-03-26T08:08:15.012Z    DEBUG   controller  waiting for configmaps  {"commit": "dc3af1a"}
2024-03-26T08:08:15.650Z    DEBUG   controller.aws  discovered region   {"commit": "dc3af1a", "region": "ap-south-1"}
2024-03-26T08:08:15.830Z    DEBUG   controller.aws  discovered cluster endpoint {"commit": "dc3af1a", "cluster-endpoint": "https://1D1B4752D1D078AEFEF3A0CC32AD79FE.gr7.ap-south-1.eks.amazonaws.com"}
2024-03-26T08:08:15.834Z    DEBUG   controller.aws  discovered kube dns {"commit": "dc3af1a", "kube-dns-ip": "10.100.0.10"}
2024-03-26T08:08:15.835Z    DEBUG   controller.aws  discovered version  {"commit": "dc3af1a", "version": "v0.27.0"}
2024/03/26 08:08:15 Registering 2 clients
2024/03/26 08:08:15 Registering 2 informer factories
2024/03/26 08:08:15 Registering 3 informers
2024/03/26 08:08:15 Registering 6 controllers
2024-03-26T08:08:15.837Z    INFO    controller  Starting server {"commit": "dc3af1a", "path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2024-03-26T08:08:15.837Z    INFO    controller  Starting server {"commit": "dc3af1a", "kind": "health probe", "addr": "[::]:8081"}
I0326 08:08:16.259816       1 leaderelection.go:248] attempting to acquire leader lease internal-system/karpenter-leader-election...
2024-03-26T08:08:16.264Z    INFO    controller  Starting informers...   {"commit": "dc3af1a"}
2024-03-26T08:08:16.591Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "fa205b03-8d6a-4b58-91d5-292e0795815b", "knative.dev/key": "internal-system/karpenter-cert", "duration": "77.261µs", "error": "error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.karpenter.sh\" not found"}
2024-03-26T08:08:16.609Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "303a0141-1370-4bb0-80e3-288f88b1203c", "knative.dev/key": "internal-system/karpenter-cert", "duration": "111.351µs", "error": "error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.karpenter.sh\" not found"}
2024-03-26T08:08:16.682Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "500a5306-ffcc-40f6-9fdd-2af36ddf3c7a", "knative.dev/key": "internal-system/karpenter-cert", "duration": "82.431µs", "error": "error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.karpenter.sh\" not found"}
2024-03-26T08:08:16.702Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "b2c597ee-d0a2-4a19-9321-3f44c2f0ffea", "knative.dev/key": "internal-system/karpenter-cert", "duration": "62.511µs", "error": "error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.karpenter.sh\" not found"}
2024-03-26T08:08:16.774Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "a21e716b-2881-4364-8382-37a4e1d52ad6", "knative.dev/key": "internal-system/karpenter-cert", "duration": "81.16µs", "error": "error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.karpenter.sh\" not found"}
2024-03-26T08:08:16.874Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "cfbfd6cd-5045-43ef-aafa-f5c3655f7521", "knative.dev/key": "internal-system/karpenter-cert", "duration": "70.85µs", "error": "error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.karpenter.sh\" not found"}
2024-03-26T08:08:16.886Z    INFO    controller.aws.pricing  updated spot pricing with instance types and offerings  {"commit": "dc3af1a", "instance-type-count": 631, "offering-count": 1424}
2024-03-26T08:08:16.909Z    ERROR   webhook.DefaultingWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "9827833b-f9d8-4ef1-94e8-9030b1d87b4e", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "299.030455ms", "error": "failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"defaulting.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
2024-03-26T08:08:16.909Z    ERROR   webhook.ValidationWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "0cb35cfc-ad01-44ef-ab1e-5e1e3f66a93c", "knative.dev/key": "internal-system/karpenter-cert", "duration": "318.031512ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
2024-03-26T08:08:16.914Z    ERROR   webhook.ValidationWebhook   Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "48cde0f7-fb43-4c9f-9160-2d826fe3c60f", "knative.dev/key": "validation.webhook.karpenter.k8s.aws", "duration": "284.505677ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
2024-03-26T08:08:16.915Z    ERROR   webhook.ConfigMapWebhook    Reconcile error {"commit": "dc3af1a", "knative.dev/traceid": "f4eb39e9-4857-462b-bd47-69023e5dbb09", "knative.dev/key": "validation.webhook.config.karpenter.sh", "duration": "305.705028ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.config.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}