aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.58k stars 914 forks source link

no service port 8443 found for service "karpenter" after migrating from cluster autoscaler #6544

Open TimSin opened 1 month ago

TimSin commented 1 month ago

Description

Observed Behavior:

After following the instructions https://karpenter.sh/preview/getting-started/migrating-from-cas/, when trying to create the NodePool (as outlined here) I receive the error:

Error from server: error when creating "nodepool.yaml": conversion webhook for karpenter.sh/v1beta1, Kind=NodePool failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": no service port 8443 found for service "karpenter"

Expected Behavior:

A working karpenter installation

Reproduction Steps (Please include YAML):

I followed the instructions at https://karpenter.sh/preview/getting-started/migrating-from-cas/ on an existing EKS cluster.

Versions:

rwe-dtroup commented 1 month ago

We have the exact same error. Unable to even remove Karpenter at this point either

engedaam commented 1 month ago

The preview section of the Karpenter docs are used upcoming and pre-released version of Karpenter. I suggest you use the latest version of Karpetner which would be v0.37.0. https://karpenter.sh/v0.37. Specifically: https://karpenter.sh/v0.37/getting-started/migrating-from-cas/

fnmarquez commented 1 month ago

Same problem here.

engedaam commented 1 month ago

@fnmarquez What is the page are you using?

NicholasRaymondiSpot commented 1 month ago

EDIT 7/26: We were able to resolve this today, it turned out the main branch CRDs were applied to the cluster instead of the 0.37.0-tagged CRDs. This caused a lot of requirements confusion but once we diff'd what was running in the cluster we were able to track it down based on the v1 configurations.

We hit the same issue when upgrading our existing EKS configurations, there was no previous cluster-autoscaler configuration in use. Using the 0.37 helm chart we have webhook.enabled: false from the default values. These same errors only started showing up in our cluster after upgrading from 0.36.2 to 0.37 and applying the latest CRDs.

>  kubectl get nodeclaim -A
Error from server: conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": no service port 8443 found for service "karpenter"
> kubectl logs deployment/karpenter -n kube-system
{"level":"ERROR","time":"2024-07-24T23:20:37.998Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"provisioner","error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\"; creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\"","errorCauses":[{"error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""},{"error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""}]}
{"level":"ERROR","time":"2024-07-24T23:20:42.036Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-1-2-3-4.ec2.internal"},"namespace":"","name":"ip-1-2-3-4.ec2.internal","reconcileID":"<ID>","error":"deleting nodeclaims, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""}

Once I flipped the webhook value to true and added the port configuration to our deployment, we started getting different errors for signed cert trust:

> kubectl get nodeclaim -A
Error from server: conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Now I'm starting to go down the rabbit-hole of generating a self-signed cert and updating the caBundle base64-encoded PEM cert value in our validation.webhook.karpenter.k8s.aws ValidatingWebhookConfiguration to see if this will get us working again. So far I'm not confident that this is the right approach for resolving this issue but it's put a halt on our Karpenter upgrades until we can find a proper path forward.

code-crusher commented 1 month ago

We are also facing the same issue. Karpenter v0.37.0 was running fine in eks cluster 1.29. Post upgrading the cluster to 1.30, this issue started to occur. We were not using webhooks before too.

> kubectl get service -n kube-system karpenter
NAME        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
karpenter   ClusterIP   172.20.116.212   <none>        8000/TCP   43s
> kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
{"level":"ERROR","time":"2024-08-05T06:40:22.229Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"XXXX-ec2nodeclass"},"namespace":"","name":"XXX-ec2nodeclass","reconcileID":"81bd9baf-5b2d-4f26-af7b-XXXXX","error":"conversion webhook for karpenter.k8s.aws/v1beta1, Kind=EC2NodeClass failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""}

Checks done:

  1. No webhooks related to Karpenter exists in validatingwebhookconfigurations and mutatingwebhookconfigurations
  2. Pods and Service are running.

Update: It's working now Purged everything related to Karpenter and re-installed. Just worked fine. Still unsure why but I did notice the CRDs were never deleted upon first try of purging -- had to delete manually by removing finaliser.

Vinaum8 commented 3 weeks ago

I installed karpenter in a separate namespace and with a different service name than the default.

I can't configure the values ​​for the webhook and it returns the error:

service: karpenter not found.

However, it is obvious that I did not create the service with this name in the kube-system namespace.

I am using version 0.37.1 and I installed karpenter with ArgoCD.

rcalosso commented 2 weeks ago

Ran into similar problems, we solved our problem by not using static CRDs that have hardcoded webhook items enabled. We switched to the karpenter-crd helm chart and updated the webhook values.

See: https://karpenter.sh/preview/upgrading/upgrade-guide/#crd-upgrades

Vinaum8 commented 2 weeks ago

6818

akramincity commented 1 day ago

the simplest way to upgrade the karpenter is to delete all validating and mutatingwebhooks since karpenter .37.0+ does not uses any webhooks kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh validation.webhook.karpenter.sh kubectl delete mutatingwebhookconfigurations defaulting.webhook.karpenter.k8s.aws

rwe-dtroup commented 1 day ago

the simplest way to upgrade the karpenter is to delete all validating and mutatingwebhooks since karpenter .37.0+ does not uses any webhooks kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh validation.webhook.karpenter.sh kubectl delete mutatingwebhookconfigurations defaulting.webhook.karpenter.k8s.aws

Not the easiest of upgrades if you're doing this through pipelines though. Having the CRD updated/replaced seems to be the better option right now.

The issue we faced was having tried to update to the newer version, you could no longer do anything due to the CRD looking for the mutating webhook, which looks like it's an issue on the listener of the server, where in actual fact it is nothing to do with it and you can end up down a rabbit hole.