aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.62k stars 922 forks source link

Karpenter conversion webhooks do not work on chart version 1.0.2 #6982

Open dcherniv opened 6 days ago

dcherniv commented 6 days ago

Description

Observed Behavior: karpenter Chart version: 1.0.2 karpenter-crd Chart version: 1.0.2 with webhooks enabled in values as follows:

karpenter-crd:
  webhook:
    enabled: false
karpenter:
  enabled: true
  webhook:
    enabled: false

The webhooks fail with the below error with no indication as to why:

{"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}

Expected Behavior: Webhook to work in 1.0.2

Reproduction Steps (Please include YAML): https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2344630870

Versions:

jonathan-innis commented 6 days ago

Can you share if there is anything in the logs that indicates why the webhook rejected the request? I'm also curious if this even made it to the webhook or if something is getting in the way of the network traffic to the service/pod.

dcherniv commented 6 days ago

@jonathan-innis nothing beyond this. {"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"} This is from Karpenter logs. I'm assuming this was trying to validate a nodeclaim which didn't have group field set. Reading the docs for the migration to v1. Can you confirm that 0.37.3 doesn't mutate resources? Its worth noting that at the point I got this error all the resources I have defined for node-pools etc in my chart were already upgraded to v1 spec and deployed prior to deploying Karpenter 1.0.2 chart My path was:

  1. Upgrade to 0.37.3
  2. Upgrade all manifests to v1 spec on 0.37.3
  3. Deploy Karpenter 1.0.2
jonathan-innis commented 6 days ago

Upgrade all manifests to v1 spec on 0.37.3

When you run through this part of your upgrade, did this also have the v1 storage version? Even with the resources on the v1 spec, if they aren't stored on the correct storage version, there can be issues during the upgrade.

jonathan-innis commented 6 days ago

Also, from what it looks like, there is something on the network path that is blocking this traffic from getting through. If there's no errors on the webhook side, that would indicate to me that there's something preventing the call coming from the apiserver to the pod service endpoint.

dcherniv commented 6 days ago

@jonathan-innis Found a couple of folks who encountered the same https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2318901834 and https://github.com/aws/karpenter-provider-aws/issues/6879#issuecomment-2315685456 Deleting the validating webhooks altogether resolves this. I think they are carried over from 0.37.3 and are not needed on 1.0.2? Can you confirm? Its weird that they are not cleaned up upon upgrade automatically.

jonathan-innis commented 1 day ago

Agreed, this looks like an issue with the interaction that Karpenter has with Argo. We definitely need to look at this since we shouldn't be leaving behind MutatingWebhookConfigurations and ValidatingWebhookConfigurations after the upgrade.