Open dcherniv opened 1 month ago
Worked around this issue via https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2344833535 It is very painful.
Can you share the NodeClaim that existed prior to the conversion as well as the config for the CRD and the deployment? This looks like it should have been defaulted by the conversion webhook so I'm surprised that this converted over without the group
being set
I have more explanation here https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2344833535 The nodeclaims, while on 1.0.2 were identical, just some of them didn't have the group set for some reason The nodepool manifest that was deployed at the time:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
labels:
argocd.argoproj.io/instance: prod-karpenter
name: default-arm64
spec:
disruption:
consolidateAfter: 0s
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: '36'
template:
metadata:
labels:
arch: arm64
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- m7g
- m6g
- c7g
- c6g
- key: karpenter.k8s.aws/instance-cpu
operator: In
values:
- '2'
- '4'
- '8'
- '16'
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- arm64
taints:
- effect: NoSchedule
key: arch
value: arm64
weight: 50
The CRDs were from version 1.0.2 of karpenter-crd chart installed into karpenter
namespace.
Relevant ec2nodeclass
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
labels:
argocd.argoproj.io/instance: prod-karpenter
name: default
spec:
amiSelectorTerms:
- alias: al2023@latest
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 100Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: initial-eks-node-group-20230826183814285900000005
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*private*'
karpenter.sh/discovery: prod
tags:
karpenter.sh/discovery: prod
just some of them didn't have the group set for some reason
I suspect that they didn't have the group because you changed versions without enabling the conversion webhooks. This is going to cause fields to get dropped and the apiserver doesn't know how to hydrate the data.
Regardless, I suspect that this fix is going to solve your problem because we will no longer be relying on the spec to be configured correctly for the Patch to succeed. Allowing you to get unblocked from this issue.
Regardless, I wouldn't recommend going through the path without the conversion webhooks because you may end up in an undefined state.
Willing to try installing a snapshot to validate if this fixes the issue? Here's the command for installing the snapshot version of Karpenter with the patch fix
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com
helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-1f88e3ae893d937267c5c28420597ea5be1d77ef" --namespace "kube-system" --create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
We plan to go from v0.37.3 to v1.0.2 with webhook enabled. Is this issue resolved? Or we should wait for a newer patch version in v1.0
I have the issue on 1.0.2
yes - updating to 1.0.3
did not remove this error
Description
Observed Behavior: After upgrade to 1.0.2 following upgrade path https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure The webhook didn't properly convert the nodeclaims as a result there are a number of claims that are failing:
These nodeclaims cannot be removed by force because karpenter 1.0.2 fails to delete them due to the aforementioned error. These nodeclaims cannot be edited to remove the finalizer because they are immutable.
Expected Behavior: Old nodeclaims left over from 0.37.3 properly cleaned up and replaced by new nodeclaims.
Reproduction Steps (Please include YAML): Follow the upgrade path, see https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2342521512
Versions:
Chart Version: 0.37.3, 1.0.2
Kubernetes Version (
kubectl version
): 1.29+ eksPlease vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment