Open dcherniv opened 6 days ago
Can you share if there is anything in the logs that indicates why the webhook rejected the request? I'm also curious if this even made it to the webhook or if something is getting in the way of the network traffic to the service/pod.
@jonathan-innis nothing beyond this. {"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"} This is from Karpenter logs. I'm assuming this was trying to validate a nodeclaim which didn't have group field set. Reading the docs for the migration to v1. Can you confirm that 0.37.3 doesn't mutate resources? Its worth noting that at the point I got this error all the resources I have defined for node-pools etc in my chart were already upgraded to v1 spec and deployed prior to deploying Karpenter 1.0.2 chart My path was:
Upgrade all manifests to v1 spec on 0.37.3
When you run through this part of your upgrade, did this also have the v1 storage version? Even with the resources on the v1 spec, if they aren't stored on the correct storage version, there can be issues during the upgrade.
Also, from what it looks like, there is something on the network path that is blocking this traffic from getting through. If there's no errors on the webhook side, that would indicate to me that there's something preventing the call coming from the apiserver to the pod service endpoint.
@jonathan-innis Found a couple of folks who encountered the same https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2318901834 and https://github.com/aws/karpenter-provider-aws/issues/6879#issuecomment-2315685456 Deleting the validating webhooks altogether resolves this. I think they are carried over from 0.37.3 and are not needed on 1.0.2? Can you confirm? Its weird that they are not cleaned up upon upgrade automatically.
Agreed, this looks like an issue with the interaction that Karpenter has with Argo. We definitely need to look at this since we shouldn't be leaving behind MutatingWebhookConfigurations and ValidatingWebhookConfigurations after the upgrade.
Description
Observed Behavior: karpenter Chart version: 1.0.2 karpenter-crd Chart version: 1.0.2 with webhooks enabled in values as follows:
The webhooks fail with the below error with no indication as to why:
Expected Behavior: Webhook to work in 1.0.2
Reproduction Steps (Please include YAML): https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2344630870
Versions:
Chart Version: 1.0.2
Kubernetes Version (
kubectl version
): 1.29+ eksPlease vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment