Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
394 stars 65 forks source link

[BUG] A100 nodes failed to join the cluster #579

Open HakjunMIN opened 4 days ago

HakjunMIN commented 4 days ago

Version

Self hosted karpneter 0.7.0 with the custom version supporting H100 (https://github.com/Azure/karpenter-provider-azure/pull/553)

Expected Behavior

node should join the cluster. Even after az aks update with empty

Actual Behavior

nodeclaim is repeatedly created but failed to join the cluster

Steps to Reproduce the Problem

generated nodeclass and node pool for A100 Suspicious the version of 2204gen2containerd/versions/202411.12.0 in nodeclass.

Resource Specs and Logs


{"level":"info","ts":1731765405.9152334,"logger":"fallback.pricing","caller":"pricing/pricing.go:201","msg":"updated on-demand pricing for region koreacentral","instance-type-count":802}
{"level":"info","ts":1731808607.4294405,"logger":"fallback.pricing","caller":"pricing/pricing.go:201","msg":"updated on-demand pricing for region koreacentral","instance-type-count":802}
{"level":"info","ts":1731851808.9626968,"logger":"fallback.pricing","caller":"pricing/pricing.go:267","msg":"updated spot pricing for region koreacentral","instance-type-count":754}
{"level":"info","ts":1731851808.9630613,"logger":"fallback.pricing","caller":"pricing/pricing.go:201","msg":"updated on-demand pricing for region koreacentral","instance-type-count":802}
{"level":"info","ts":1731895010.7737744,"logger":"fallback.pricing","caller":"pricing/pricing.go:201","msg":"updated on-demand pricing for region koreacentral","instance-type-count":802}
{"level":"INFO","time":"2024-11-18T06:58:23.982Z","logger":"controller","message":"found provisionable pod(s)","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"7a0fe0dc-6a75-4e7e-8b9d-79a8949da88d","Pods":"eui-heo/a100-20cpu-200gi-0","duration":"17.285189ms"}
{"level":"INFO","time":"2024-11-18T06:58:23.982Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"7a0fe0dc-6a75-4e7e-8b9d-79a8949da88d","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-11-18T06:58:24.000Z","logger":"controller","message":"created nodeclaim","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"7a0fe0dc-6a75-4e7e-8b9d-79a8949da88d","NodePool":{"name":"a100"},"NodeClaim":{"name":"a100-7zl52"},"requests":{"cpu":"20570m","memory":"205738Mi","nvidia.com/gpu":"1","pods":"10"},"instance-types":"Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4"}
{"level":"info","ts":1731913105.0520005,"logger":"fallback","caller":"instance/instance.go:611","msg":"Selected instance type Standard_NC48ads_A100_v4"}
{"level":"info","ts":1731913105.0539694,"logger":"fallback","caller":"imagefamily/resolver.go:87","msg":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202411.12.0 for instance type Standard_NC48ads_A100_v4"}
{"level":"info","ts":1731913105.0542386,"logger":"fallback","caller":"loadbalancer/loadbalancer.go:130","msg":"Querying load balancers in resource group MC_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral"}
{"level":"info","ts":1731913105.1441329,"logger":"fallback","caller":"loadbalancer/loadbalancer.go:145","msg":"Found 2 load balancers of interest"}
{"level":"info","ts":1731913217.833931,"logger":"fallback","caller":"instance/instance.go:152","msg":"launched new instance","launched-instance":"/subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/MC_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-7zl52","hostname":"aks-a100-7zl52","type":"Standard_NC48ads_A100_v4","zone":"1","capacity-type":"on-demand"}
{"level":"INFO","time":"2024-11-18T07:00:17.834Z","logger":"controller","message":"launched nodeclaim","commit":"ee194ab","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"a100-7zl52"},"namespace":"","name":"a100-7zl52","reconcileID":"07a79b5f-0603-45d8-8436-03e6d6a7dc8a","provider-id":"azure:///subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/mc_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-7zl52","instance-type":"Standard_NC48ads_A100_v4","zone":"koreacentral-1","capacity-type":"on-demand","allocatable":{"cpu":"47420m","ephemeral-storage":"128G","memory":"400085Mi","nvidia.com/gpu":"2","pods":"110"}}
{"level":"INFO","time":"2024-11-18T07:15:24.055Z","logger":"controller","message":"found provisionable pod(s)","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"f55ed39a-6f10-4628-a136-bc8a4d7b78fd","Pods":"eui-heo/a100-20cpu-200gi-0","duration":"16.895173ms"}
{"level":"INFO","time":"2024-11-18T07:15:24.055Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"f55ed39a-6f10-4628-a136-bc8a4d7b78fd","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-11-18T07:15:24.072Z","logger":"controller","message":"created nodeclaim","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"f55ed39a-6f10-4628-a136-bc8a4d7b78fd","NodePool":{"name":"a100"},"NodeClaim":{"name":"a100-vmdnp"},"requests":{"cpu":"20570m","memory":"205738Mi","nvidia.com/gpu":"1","pods":"10"},"instance-types":"Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4"}
{"level":"info","ts":1731914125.1174219,"logger":"fallback","caller":"instance/instance.go:611","msg":"Selected instance type Standard_NC48ads_A100_v4"}
{"level":"info","ts":1731914125.1191142,"logger":"fallback","caller":"imagefamily/resolver.go:87","msg":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202411.12.0 for instance type Standard_NC48ads_A100_v4"}
{"level":"INFO","time":"2024-11-18T07:16:04.437Z","logger":"controller","message":"deleted nodeclaim","commit":"ee194ab","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"a100-7zl52"},"namespace":"","name":"a100-7zl52","reconcileID":"00821b71-f356-4e55-9d3c-2be450324735","Node":{"name":""},"provider-id":"azure:///subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/mc_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-7zl52"}
{"level":"info","ts":1731914237.3855522,"logger":"fallback","caller":"instance/instance.go:152","msg":"launched new instance","launched-instance":"/subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/MC_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-vmdnp","hostname":"aks-a100-vmdnp","type":"Standard_NC48ads_A100_v4","zone":"1","capacity-type":"on-demand"}
{"level":"INFO","time":"2024-11-18T07:17:17.385Z","logger":"controller","message":"launched nodeclaim","commit":"ee194ab","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"a100-vmdnp"},"namespace":"","name":"a100-vmdnp","reconcileID":"b0bb4f02-6988-4878-ba10-315e92b2e846","provider-id":"azure:///subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/mc_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-vmdnp","instance-type":"Standard_NC48ads_A100_v4","zone":"koreacentral-1","capacity-type":"on-demand","allocatable":{"cpu":"47420m","ephemeral-storage":"128G","memory":"400085Mi","nvidia.com/gpu":"2","pods":"110"}}
{"level":"INFO","time":"2024-11-18T07:32:24.127Z","logger":"controller","message":"found provisionable pod(s)","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"159cbf63-b1bc-4609-89e7-cdc299ffb13b","Pods":"eui-heo/a100-20cpu-200gi-0","duration":"17.161468ms"}
{"level":"INFO","time":"2024-11-18T07:32:24.127Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"159cbf63-b1bc-4609-89e7-cdc299ffb13b","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-11-18T07:32:24.144Z","logger":"controller","message":"created nodeclaim","commit":"ee194ab","controller":"provisioner","namespace":"","name":"","reconcileID":"159cbf63-b1bc-4609-89e7-cdc299ffb13b","NodePool":{"name":"a100"},"NodeClaim":{"name":"a100-dktb8"},"requests":{"cpu":"20570m","memory":"205738Mi","nvidia.com/gpu":"1","pods":"10"},"instance-types":"Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4"}
{"level":"info","ts":1731915145.189231,"logger":"fallback","caller":"instance/instance.go:611","msg":"Selected instance type Standard_NC48ads_A100_v4"}
{"level":"info","ts":1731915145.1910028,"logger":"fallback","caller":"imagefamily/resolver.go:87","msg":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202411.12.0 for instance type Standard_NC48ads_A100_v4"}
{"level":"INFO","time":"2024-11-18T07:33:03.906Z","logger":"controller","message":"deleted nodeclaim","commit":"ee194ab","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"a100-vmdnp"},"namespace":"","name":"a100-vmdnp","reconcileID":"30609851-7cf6-49a1-9b0c-1fc9e61c6504","Node":{"name":""},"provider-id":"azure:///subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/mc_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-vmdnp"}
{"level":"info","ts":1731915317.596937,"logger":"fallback","caller":"instance/instance.go:152","msg":"launched new instance","launched-instance":"/subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/MC_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-dktb8","hostname":"aks-a100-dktb8","type":"Standard_NC48ads_A100_v4","zone":"1","capacity-type":"on-demand"}
{"level":"INFO","time":"2024-11-18T07:35:17.597Z","logger":"controller","message":"launched nodeclaim","commit":"ee194ab","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"a100-dktb8"},"namespace":"","name":"a100-dktb8","reconcileID":"cf485074-a773-48b8-8b33-d440e6c4bbcd","provider-id":"azure:///subscriptions/65dab16c-f619-4556-99b3-86a84c522eb4/resourceGroups/mc_aladin-kubernetes-v1-prd_aladin-aks-v1-prd_koreacentral/providers/Microsoft.Compute/virtualMachines/aks-a100-dktb8","instance-type":"Standard_NC48ads_A100_v4","zone":"koreacentral-1","capacity-type":"on-demand","allocatable":{"cpu":"47420m","ephemeral-storage":"128G","memory":"400085Mi","nvidia.com/gpu":"2","pods":"110"}}

Community Note

mdfaizsiddiqui commented 1 day ago

We are also experiencing the same issue, node claim able to create the node but node is not able to join the cluster -

apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    karpenter.azure.com/aksnodeclass-hash: "17390774718828605676"
    karpenter.azure.com/aksnodeclass-hash-version: v1
    karpenter.azure.com/in-place-update-hash: "3739780270"
    karpenter.sh/nodepool-hash: "10684441112471014376"
    karpenter.sh/nodepool-hash-version: v3
    karpenter.sh/stored-version-migrated: "true"
  creationTimestamp: "2024-11-20T12:55:10Z"
  finalizers:
  - karpenter.sh/termination
  generateName: default-
  generation: 1
  labels:
    karpenter.azure.com/sku-cpu: "2"
    karpenter.azure.com/sku-encryptionathost-capable: "true"
    karpenter.azure.com/sku-family: D
    karpenter.azure.com/sku-gpu-count: "0"
    karpenter.azure.com/sku-hyperv-generation: "2"
    karpenter.azure.com/sku-memory: "8192"
    karpenter.azure.com/sku-name: Standard_D2ps_v5
    karpenter.azure.com/sku-networking-accelerated: "true"
    karpenter.azure.com/sku-storage-premium-capable: "true"
    karpenter.azure.com/sku-version: "5"
    karpenter.azure.com/zone: eastus-1
    karpenter.sh/capacity-type: spot
    karpenter.sh/nodepool: default
    kubernetes.io/arch: arm64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: Standard_D2ps_v5
    topology.kubernetes.io/region: eastus
  name: default-j77ws
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: default
    uid: bb2aa657-070b-4802-b58c-937650391842
  resourceVersion: "15245540"
  uid: b2f1920c-1450-42a4-bfab-a9383b50664f
spec:
  expireAfter: 720h
  nodeClassRef:
    group: karpenter.azure.com
    kind: AKSNodeClass
    name: default
  requirements:
  - key: karpenter.azure.com/sku-name
    operator: In
    values:
    - Standard_D2ps_v5
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
    - spot
  - key: kubernetes.io/arch
    operator: In
    values:
    - arm64
  - key: karpenter.azure.com/sku-family
    operator: In
    values:
    - D
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - Standard_D2ps_v5
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - eastus-1
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - default
  resources:
    requests:
      cpu: 1054m
      memory: 1878Mi
      pods: "14"
status:
  allocatable:
    cpu: 1900m
    ephemeral-storage: 128G
    memory: "5226731929"
    pods: "110"
  capacity:
    cpu: "2"
    ephemeral-storage: 128G
    memory: "7945689497"
    pods: "110"
  conditions:
  - lastTransitionTime: "2024-11-20T12:56:33Z"
    message: Node not registered with cluster
    reason: NodeNotFound
    status: Unknown
    type: Initialized
  - lastTransitionTime: "2024-11-20T12:56:33Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-11-20T12:56:33Z"
    message: Initialized=Unknown, Registered=Unknown
    reason: UnhealthyDependents
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-11-20T12:56:33Z"
    message: Node not registered with cluster
    reason: NodeNotFound
    status: Unknown
    type: Registered
  imageID: /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2arm64containerd/versions/202411.12.0
  providerID: azure:///subscriptions/xxxxxxxxxxxxx/resourceGroups/staging-eastus-aks-k1-nodes/providers/Microsoft.Compute/virtualMachines/aks-default-j77ws

I hopped on one of the node which is failing and saw below error in kubelet logs -

Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.481871    5465 csi_plugin.go:892] Failed to contact API server when waiting for CSINode publishing: Unauthorized
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.689912    5465 kubelet_node_status.go:356] "Setting node annotation to enable volume controller attach/detach"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.690179    5465 kubelet_node_status.go:373] "the node label will overwrite default setting" labelKey="kubernetes.io/arch" labelValue="arm64" default="arm64"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.690312    5465 kubelet_node_status.go:373] "the node label will overwrite default setting" labelKey="kubernetes.io/os" labelValue="linux" default="linux"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691144    5465 kubelet_node_status.go:679] "Recording event message for node" node="aks-default-6jmgj" event="NodeHasSufficientMemory"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691176    5465 kubelet_node_status.go:679] "Recording event message for node" node="aks-default-6jmgj" event="NodeHasNoDiskPressure"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691187    5465 kubelet_node_status.go:679] "Recording event message for node" node="aks-default-6jmgj" event="NodeHasSufficientPID"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: I1120 07:59:11.691209    5465 kubelet_node_status.go:73] "Attempting to register node" node="aks-default-6jmgj"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: E1120 07:59:11.711941    5465 controller.go:145] "Failed to ensure lease exists, will retry" err="Unauthorized" interval="7s"
Nov 20 07:59:11 aks-default-6jmgj kubelet[5465]: E1120 07:59:11.712268    5465 kubelet_node_status.go:96] "Unable to register node with API server" err="Unauthorized" node="aks-default-6jmgj"