Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] AKS Automatic takes up to an hour to install extensions and installers time out with errors #4513

Open brooke-hamilton opened 2 weeks ago

brooke-hamilton commented 2 weeks ago

Describe the bug Installing an AKS extension on AKS Automatic, or installing a Kubernetes add-in that requires a tolerated taint of CriticalAddonsOnly, takes about an hour to install, and the installers return timeout errors. Examples are Dapr and Radius.

To Reproduce Steps to reproduce the behavior:

This example installs Dapr using the az k8s-extension create command. The Radius commandrad install kubernetes results in similar behavior.

resource_group="<resource group name>"
cluster_name=$(mktemp -u "$resource_group-XXXX")
echo "Creating cluster $cluster_name"
az aks create --resource-group $resource_group --name "$cluster_name" --sku automatic --generate-ssh-keys -l southcentralus 
az aks get-credentials --resource-group $resource_group --name "$cluster_name"
az k8s-extension create --cluster-type managedClusters \
--cluster-name "$cluster_name" \
--resource-group $resource_group \
--name dapr \
--extension-type Microsoft.Dapr \
--auto-upgrade-minor-version false

Expected behavior The extension installs as expected.

Screenshots

Results of the az k8s-extension create command:

(ExtensionOperationFailed) The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state Last resource not ready was dapr-system/dapr-monitoring-metrics For more events check kubernetes events using kubectl events -n dapr-system : Recommendation Please contact Microsoft support for further inquiries : InnerError [release dapr failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: For additional troubleshooting information, please see https://aka.ms/dapr-aks.
Code: ExtensionOperationFailed
Message: The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state Last resource not ready was dapr-system/dapr-monitoring-metrics For more events check kubernetes events using kubectl events -n dapr-system : Recommendation Please contact Microsoft support for further inquiries : InnerError [release dapr failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: For additional troubleshooting information, please see https://aka.ms/dapr-aks.

While the installer is running, this command returns the results below. You can see that the Dapr namespace took about 30 minutes to deploy and that none of the Dapr pods have started. At this point the Dapr installer was still running.

$ kubectl get nodes && kubectl get namespaces && kubectl get pods -n dapr-system
NAME                                STATUS   ROLES    AGE   VERSION
aks-nodepool1-91565423-vmss000000   Ready    <none>   37m   v1.29.7
aks-nodepool1-91565423-vmss000001   Ready    <none>   37m   v1.29.7
aks-nodepool1-91565423-vmss000002   Ready    <none>   37m   v1.29.7
NAME                 STATUS   AGE
app-routing-system   Active   21m
dapr-system          Active   7m9s
default              Active   38m
gatekeeper-system    Active   37m
kube-node-lease      Active   38m
kube-public          Active   38m
kube-system          Active   38m
NAME                                       READY   STATUS    RESTARTS   AGE
dapr-monitoring-metrics-77bbd48f79-j2dz6   0/4     Pending   0          6m25s
dapr-operator-7c77dccfcb-7t5vr             0/1     Pending   0          6m25s
dapr-operator-7c77dccfcb-vr754             0/1     Pending   0          6m25s
dapr-operator-7c77dccfcb-zvg26             0/1     Pending   0          6m25s
dapr-placement-server-0                    0/1     Pending   0          6m25s
dapr-placement-server-1                    0/1     Pending   0          6m25s
dapr-placement-server-2                    0/1     Pending   0          6m25s
dapr-sentry-74d8d794fd-49jkv               0/1     Pending   0          6m25s
dapr-sentry-74d8d794fd-wk69h               0/1     Pending   0          6m25s
dapr-sentry-74d8d794fd-zdfvf               0/1     Pending   0          6m25s
dapr-sidecar-injector-74cb4f87fd-9fk42     0/1     Pending   0          6m25s
dapr-sidecar-injector-74cb4f87fd-c6fns     0/1     Pending   0          6m25s
dapr-sidecar-injector-74cb4f87fd-jddhb     0/1     Pending   0          6m25s

Examining the nodes that fail to deploy, the error below is shown in the Azure Portal:

involvedObject:
  kind: Pod
  namespace: dapr-system
  name: dapr-operator-7c77dccfcb-7t5vr
  uid: 11632e33-5066-4abc-97ec-c6c74945a78c
  apiVersion: v1
  resourceVersion: '12726'
reason: FailedScheduling
message: >-
  0/3 nodes are available: 3 node(s) had untolerated taint {CriticalAddonsOnly:
  true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for
  scheduling.

Eventually, AKS Automatic creates an additional node without the CriticalAddonsOnly taint, and the pods are able to start.

$ kubectl get nodes && kubectl get namespaces && kubectl get pods -n dapr-system
NAME                                STATUS   ROLES    AGE    VERSION
aks-default-nkpzb                   Ready    <none>   116s   v1.29.7
aks-nodepool1-91565423-vmss000000   Ready    <none>   50m    v1.29.7
aks-nodepool1-91565423-vmss000001   Ready    <none>   50m    v1.29.7
aks-nodepool1-91565423-vmss000002   Ready    <none>   50m    v1.29.7
NAME                 STATUS   AGE
app-routing-system   Active   35m
dapr-system          Active   20m
default              Active   52m
gatekeeper-system    Active   51m
kube-node-lease      Active   52m
kube-public          Active   52m
kube-system          Active   52m
NAME                                       READY   STATUS    RESTARTS   AGE
dapr-monitoring-h5tvr                      3/3     Running   0          98s
dapr-monitoring-metrics-77bbd48f79-66srf   4/4     Running   0          2m51s
dapr-operator-7c77dccfcb-kj4tm             1/1     Running   0          2m51s
dapr-operator-7c77dccfcb-nppcj             1/1     Running   0          2m51s
dapr-operator-7c77dccfcb-wztws             1/1     Running   0          2m51s
dapr-placement-server-0                    1/1     Running   0          2m51s
dapr-placement-server-1                    1/1     Running   0          2m51s
dapr-placement-server-2                    1/1     Running   0          2m51s
dapr-sentry-74d8d794fd-k6mdx               1/1     Running   0          2m51s
dapr-sentry-74d8d794fd-ldbl9               1/1     Running   0          2m51s
dapr-sentry-74d8d794fd-rmb6j               1/1     Running   0          2m51s
dapr-sidecar-injector-74cb4f87fd-cqr26     1/1     Running   0          2m51s
dapr-sidecar-injector-74cb4f87fd-ldfjm     1/1     Running   0          2m51s
dapr-sidecar-injector-74cb4f87fd-p7wvn     1/1     Running   0          2m51s

Environment (please complete the following information):

Additional context It looks like the add-ins like Dapr and Radius cannot tolerate the CriticalAddonsOnly taint. Eventually AKS Automatic adds a new node that does not have the taint, and the Dapr nodes are able to start.

Adding nodes to AKS Automatic results in new nodes that have the taint applied. Adding a node without the taint fails with an error.

$ az aks nodepool update --resource-group brhamilt-aks --cluster-name brhamilt-aks-Caoh --name nodepool1 --node-taints ""
The behavior of this command has been altered by the following extension: aks-preview
(BadRequest) Managed cluster 'Automatic' SKU should set taint 'CriticalAddonsOnly=true:NoSchedule' for 'System' node pool
Code: BadRequest
Message: Managed cluster 'Automatic' SKU should set taint 'CriticalAddonsOnly=true:NoSchedule' for 'System' node pool

Removing the taint results in an error.

sabbour commented 1 week ago

@brooke-hamilton are you able to provide the Node Auto Provisioning logs from the cluster?

kubectl get events -A --field-selector source=karpenter -w