Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
308 stars 47 forks source link

AKS cluster restarting leads Karpenter to have malfunctionig #371

Closed HakjunMIN closed 1 month ago

HakjunMIN commented 1 month ago

Version

0.4.0

Expected Behavior

Stop and start AKS cluster in the portal, karpenter should make VM to join in the cluster member.

Actual Behavior

Stop and start AKS cluster in the portal, karpenter could not make VM join in the cluster member.

I do see karpentermsi made vm properly and logs having normal nodeclaim, but the vm that created cannot be a member of cluster. and karpenter waits for couples of time then made another node claim that leads create another vm. but vm that just created fails to join the cluster. It creates vm, fails to join cluster repeatedly.

Tried to uninstall karpenter then re-install, it is not working though.

Steps to Reproduce the Problem

Stop and start Cluster, make karpenter claim a node

Resource Specs and Logs


{"level":"DEBUG","time":"2024-05-26T23:59:30.276Z","logger":"controller.nodeclaim.lifecycle","message":"Created  virtual machine AKS identifying extension for aks-gpu-tc7ln, with an id of /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-tc7ln/extensions/computeAksLinuxBilling","commit":"bbaa9b7","nodeclaim":"gpu-tc7ln"}
{"level":"INFO","time":"2024-05-26T23:59:30.276Z","logger":"controller.nodeclaim.lifecycle","message":"launched new instance","commit":"bbaa9b7","nodeclaim":"gpu-tc7ln","launched-instance":"/subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-tc7ln","hostname":"aks-gpu-tc7ln","type":"Standard_NV6ads_A10_v5","zone":"3","capacity-type":"on-demand"}
{"level":"INFO","time":"2024-05-26T23:59:30.276Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"bbaa9b7","nodeclaim":"gpu-tc7ln","provider-id":"azure:///subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/mc_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-tc7ln","instance-type":"Standard_NV6ads_A10_v5","zone":"","capacity-type":"on-demand","allocatable":{"cpu":"5840m","ephemeral-storage":"128G","memory":"46288Mi","nvidia.com/gpu":"1","pods":"110"}}
{"level":"INFO","time":"2024-05-26T23:59:36.899Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"bbaa9b7","pods":"kubeflow-user-example-com/ja-0","duration":"35.37174ms"}
{"level":"INFO","time":"2024-05-26T23:59:46.903Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"bbaa9b7","pods":"kubeflow-user-example-com/ja-0","duration":"37.83814ms"}
{"level":"INFO","time":"2024-05-26T23:59:56.902Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"bbaa9b7","pods":"kubeflow-user-example-com/ja-0","duration":"36.251346ms"}
{"level":"INFO","time":"2024-05-27T00:00:06.901Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"bbaa9b7","pods":"kubeflow-user-example-com/ja-0","duration":"34.524844ms"}

...
...
...

{"level":"INFO","time":"2024-05-27T00:15:58.957Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"bbaa9b7","nodeclaim":"gpu-rbdm8","provider-id":"azure:///subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/mc_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-rbdm8","instance-type":"Standard_NV6ads_A10_v5","zone":"","capacity-type":"on-demand","allocatable":{"cpu":"5840m","ephemeral-storage":"128G","memory":"46288Mi","nvidia.com/gpu":"1","pods":"110"}}
{"level":"INFO","time":"2024-05-27T00:16:06.978Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"bbaa9b7","pods":"kubeflow-user-example-com/ja-0","duration":"32.973449ms"}
{"level":"INFO","time":"2024-05-27T00:16:17.009Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"bbaa9b7","pods":"kubeflow-user-example-com/ja-0","duration":"63.578604ms"}
...
...

Community Note

HakjunMIN commented 1 month ago

az aks update -g rg -n clusterName didn't help this unfortunately since I thought similar one with #248

Bryce-Soghigian commented 1 month ago

We do not support start stop. For NAP we explicitly block it in the validation. For self-hosted nothing is blocking you but it's listed as unsupported.

I am not sure when we will prioritize this work. Thanks for sharing the failure mode though!

HakjunMIN commented 1 month ago

Got it. Hope start-stop support is on the backlogs. Thank you @Bryce-Soghigian

Bryce-Soghigian commented 1 month ago

thanks @HakjunMIN!