Open HakjunMIN opened 4 days ago
We are also experiencing the same issue, node claim able to create the node but node is not able to join the cluster -
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
annotations:
karpenter.azure.com/aksnodeclass-hash: "17390774718828605676"
karpenter.azure.com/aksnodeclass-hash-version: v1
karpenter.azure.com/in-place-update-hash: "3739780270"
karpenter.sh/nodepool-hash: "10684441112471014376"
karpenter.sh/nodepool-hash-version: v3
karpenter.sh/stored-version-migrated: "true"
creationTimestamp: "2024-11-20T12:55:10Z"
finalizers:
- karpenter.sh/termination
generateName: default-
generation: 1
labels:
karpenter.azure.com/sku-cpu: "2"
karpenter.azure.com/sku-encryptionathost-capable: "true"
karpenter.azure.com/sku-family: D
karpenter.azure.com/sku-gpu-count: "0"
karpenter.azure.com/sku-hyperv-generation: "2"
karpenter.azure.com/sku-memory: "8192"
karpenter.azure.com/sku-name: Standard_D2ps_v5
karpenter.azure.com/sku-networking-accelerated: "true"
karpenter.azure.com/sku-storage-premium-capable: "true"
karpenter.azure.com/sku-version: "5"
karpenter.azure.com/zone: eastus-1
karpenter.sh/capacity-type: spot
karpenter.sh/nodepool: default
kubernetes.io/arch: arm64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: Standard_D2ps_v5
topology.kubernetes.io/region: eastus
name: default-j77ws
ownerReferences:
- apiVersion: karpenter.sh/v1
blockOwnerDeletion: true
kind: NodePool
name: default
uid: bb2aa657-070b-4802-b58c-937650391842
resourceVersion: "15245540"
uid: b2f1920c-1450-42a4-bfab-a9383b50664f
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.azure.com
kind: AKSNodeClass
name: default
requirements:
- key: karpenter.azure.com/sku-name
operator: In
values:
- Standard_D2ps_v5
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- spot
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.azure.com/sku-family
operator: In
values:
- D
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_D2ps_v5
- key: topology.kubernetes.io/zone
operator: In
values:
- eastus-1
- key: karpenter.sh/nodepool
operator: In
values:
- default
resources:
requests:
cpu: 1054m
memory: 1878Mi
pods: "14"
status:
allocatable:
cpu: 1900m
ephemeral-storage: 128G
memory: "5226731929"
pods: "110"
capacity:
cpu: "2"
ephemeral-storage: 128G
memory: "7945689497"
pods: "110"
conditions:
- lastTransitionTime: "2024-11-20T12:56:33Z"
message: Node not registered with cluster
reason: NodeNotFound
status: Unknown
type: Initialized
- lastTransitionTime: "2024-11-20T12:56:33Z"
message: ""
reason: Launched
status: "True"
type: Launched
- lastTransitionTime: "2024-11-20T12:56:33Z"
message: Initialized=Unknown, Registered=Unknown
reason: UnhealthyDependents
status: Unknown
type: Ready
- lastTransitionTime: "2024-11-20T12:56:33Z"
message: Node not registered with cluster
reason: NodeNotFound
status: Unknown
type: Registered
imageID: /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2arm64containerd/versions/202411.12.0
providerID: azure:///subscriptions/xxxxxxxxxxxxx/resourceGroups/staging-eastus-aks-k1-nodes/providers/Microsoft.Compute/virtualMachines/aks-default-j77ws
I hopped on one of the node which is failing and saw below error in kubelet logs -
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.481871 5465 csi_plugin.go:892] Failed to contact API server when waiting for CSINode publishing: Unauthorized
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.689912 5465 kubelet_node_status.go:356] "Setting node annotation to enable volume controller attach/detach"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.690179 5465 kubelet_node_status.go:373] "the node label will overwrite default setting" labelKey="kubernetes.io/arch" labelValue="arm64" default="arm64"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.690312 5465 kubelet_node_status.go:373] "the node label will overwrite default setting" labelKey="kubernetes.io/os" labelValue="linux" default="linux"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691144 5465 kubelet_node_status.go:679] "Recording event message for node" node="aks-default-6jmgj" event="NodeHasSufficientMemory"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691176 5465 kubelet_node_status.go:679] "Recording event message for node" node="aks-default-6jmgj" event="NodeHasNoDiskPressure"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691187 5465 kubelet_node_status.go:679] "Recording event message for node" node="aks-default-6jmgj" event="NodeHasSufficientPID"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691209 5465 kubelet_node_status.go:73] "Attempting to register node" node="aks-default-6jmgj"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: E1120 07:59:11.711941 5465 controller.go:145] "Failed to ensure lease exists, will retry" err="Unauthorized" interval="7s"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: E1120 07:59:11.712268 5465 kubelet_node_status.go:96] "Unable to register node with API server" err="Unauthorized" node="aks-default-6jmgj"
Version
Self hosted karpneter 0.7.0 with the custom version supporting H100 (https://github.com/Azure/karpenter-provider-azure/pull/553)
Expected Behavior
node should join the cluster. Even after
az aks update with empty
Actual Behavior
nodeclaim is repeatedly created but failed to join the cluster
Steps to Reproduce the Problem
generated nodeclass and node pool for A100 Suspicious the version of 2204gen2containerd/versions/202411.12.0 in nodeclass.
Resource Specs and Logs
nodeclaim status
apiVersion: karpenter.sh/v1 kind: NodeClaim metadata: annotations: karpenter.azure.com/aksnodeclass-hash: "17390774718828605676" karpenter.azure.com/aksnodeclass-hash-version: v1 karpenter.azure.com/in-place-update-hash: "383245467" karpenter.sh/nodepool-hash: "1090957544758137294" karpenter.sh/nodepool-hash-version: v3 karpenter.sh/stored-version-migrated: "true" creationTimestamp: "2024-11-18T06:58:23Z" finalizers:
karpenter log
Community Note