Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

[BUG] AKS and VMSS stuck in Creating state #4353

Closed cloudziu closed 3 months ago

cloudziu commented 5 months ago

Describe the bug When creating AKS cluster, the default node pool cannot provision sucesfully. There is a Failure information in the VMSS Azure Activity Log. Because of that AKS is stuck in Creating state, same the VMSS.

I have not provided any custom scripts to the extensions.

This is the status message from the Activity Log error:

Operation name: Create or Update Virtual Machine Scale Set

Event initiated by: AzureContainerService

Error code: ResourceOperationFailure

Message: The resource operation completed with terminal provisioning state 'Failed'.

{"status":"Failed","error":{"code":"ResourceOperationFailure","message":"The resource operation completed with terminal provisioning state 'Failed'.","details":[{"code":"VMExtensionProvisioningError","target":"0","message":"VM has reported a failure when processing extension 'vmssCSE' (publisher 'Microsoft.Azure.Extensions' and type 'CustomScript'). Error message: 'Enable failed: failed to execute command: command terminated with exit status=41
[stdout]
Reading FIFO (Named Pipe): collect/ss_stats.txt
  adding: collect/ss_stats.txt (deflated 49%)
Adding log files to zip archive...
  adding: etc/default/kubelet (deflated 56%)
  adding: var/log/azure-cnimonitor.log (stored 0%)
  adding: var/log/azure-vnet-ipam.log (stored 0%)
  adding: var/log/azure-vnet-telemetry.log (stored 0%)
  adding: var/log/azure-vnet.log (stored 0%)
  adding: var/lib/waagent/provisioned (stored 0%)
  adding: etc/fstab (deflated 22%)
  adding: etc/ssh/sshd_config (deflated 54%)
  adding: boot/grub/grub.cfg (deflated 76%)
  adding: etc/lsb-release (deflated 24%)
  adding: etc/os-release (deflated 38%)
  adding: etc/hostname (deflated 6%)
  adding: etc/apt/sources.list (deflated 73%)
  adding: etc/apt/sources.list.d/microsoft-prod-testing.list (deflated 7%)
  adding: etc/apt/sources.list.d/microsoft-prod.list (deflated 5%)
  adding: etc/netplan/50-cloud-init.yaml (deflated 45%)
  adding: etc/nsswitch.conf (deflated 50%)
  adding: etc/resolv.conf (deflated 49%)
  adding: run/systemd/resolve/stub-resolv.conf (deflated 49%)
  adding: etc/ufw/ufw.conf (deflated 31%)
  adding: etc/waagent.conf (deflated 54%)
  adding: var/lib/hyperv/.kvp_pool_0 (stored 0%)
  adding: var/lib/hyperv/.kvp_pool_1 (deflated 92%)
  adding: var/lib/hyperv/.kvp_pool_2 (stored 0%)
  adding: var/lib/hyperv/.kvp_pool_3 (deflated 99%)
  adding: var/lib/hyperv/.kvp_pool_4 (stored 0%)
  adding: var/log/azure/custom-script/handler.log (deflated 84%)
  adding: var/lib/waagent/ovf-env.xml (deflated 35%)
  adding: var/lib/waagent/Microsoft.AKS.Compute.AKS.Linux.Billing-1.0.0/status/1.status (deflated 50%)
  adding: var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.1.10/status/1.status (deflated 31%)
  adding: var/lib/waagent/Microsoft.AKS.Compute.AKS.Linux.Billing-1.0.0/config/1.settings (deflated 37%)
  adding: var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.1.10/config/1.settings (deflated 24%)
  adding: var/lib/waagent/Microsoft.AKS.Compute.AKS.Linux.Billing-1.0.0/config/HandlerState (stored 0%)
  adding: var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.1.10/config/HandlerState (stored 0%)
  adding: var/lib/waagent/Microsoft.AKS.Compute.AKS.Linux.Billing-1.0.0/config/HandlerStatus (deflated 24%)
  adding: var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.1.10/config/HandlerStatus (deflated 24%)
  adding: var/lib/waagent/SharedConfig.xml (deflated 51%)
  adding: var/log/cloud-init-output.log (deflated 75%)
  adding: var/log/cloud-init.log (deflated 87%)
  adding: var/log/azure/Microsoft.AKS.Compute.AKS.Linux.Billing/CommandExecution.log (deflated 75%)
  adding: var/log/azure/Microsoft.Azure.Extensions.CustomScript/CommandExecution.log (deflated 82%)
  adding: var/log/azure/aks/cloud-config.txt (deflated 33%)
  adding: var/log/azure/aks/cluster-provision-cse-output.log (deflated 66%)
  adding: var/log/azure/aks/cluster-provision.log (deflated 94%)
  adding: var/log/azure/aks/components.json (deflated 86%)
  adding: var/log/azure/aks/kube-proxy-images.json (deflated 66%)
  adding: var/log/azure/aks/manifest.json (deflated 75%)
  adding: var/log/azure/aks/provision.json (deflated 77%)
  adding: var/log/azure/aks/vhd-install.complete (deflated 82%)
  adding: var/log/azure/Microsoft.Azure.Extensions.CustomScript/events/1718264198403.json (deflated 57%)
  adding: var/log/syslog (deflated 88%)
  adding: var/log/syslog.1 (deflated 84%)
  adding: var/log/messages (deflated 88%)
  adding: var/log/messages.1 (deflated 85%)
  adding: var/log/kern.log (deflated 91%)
  adding: var/log/kern.log.1 (deflated 79%)
  adding: var/log/dmesg (deflated 70%)
  adding: var/log/dmesg.0 (deflated 70%)
  adding: var/log/dmesg.1.gz (stored 0%)
  adding: var/log/dmesg.2.gz (stored 0%)
  adding: var/log/dmesg.3.gz (stored 0%)
  adding: var/log/dpkg.log (stored 0%)
  adding: var/log/dpkg.log.1 (deflated 92%)
  adding: var/log/auth.log (deflated 63%)
  adding: var/log/auth.log.1 (deflated 89%)
Log bundle size: 1016K  aks_logs.zip
Uploading log bundle: Successfully uploaded logs
Cleaning up /tmp/tmp.gaY0U8EhNc...
Log collection finished.

[stderr]
date: invalid date ‘n/a’
'. More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot. "}]}}

Things that I've tested:

rahimek commented 5 months ago

Hi!

I have the same problem in our environment when we try to create AKS cluster via terraform. Terraform tries to create the cluster for an hour and a half and then throws an error.

In Activity log in Azure Portal I noticed exactly the same error message like your @cloudziu . I guess this is a bug on the Azure side, because one week ago we were able to create aks clusters without any problems using the same terraform module like now.

cloudziu commented 5 months ago

@rahimek Resolved on my site. We had to whitelist this address acs-mirror.azureedge.net in our Firewall. There was probably an address change in the deployment scripts or something.

JoeyC-Dev commented 5 months ago

@rahimek Resolved on my site. We had to whitelist this address acs-mirror.azureedge.net in our Firewall. There was probably an address change in the deployment scripts or something.

If so, this is possibly related: https://github.com/MicrosoftDocs/azure-docs/pull/123359

Hmmm, but when I have a closer look, the domain you are mentioning was in the FQDN list quite a long time ago. (At least from when I started getting familiar with AKS, it is already there.) https://learn.microsoft.com/en-us/azure/aks/outbound-rules-control-egress#azure-global-required-fqdn--application-rules

image

rahimek commented 5 months ago

Thanks for your replies. Actually the documentation says that is related to connection problems (https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/error-code-cnidownloadtimeoutvmextensionerror). But like you said @JoeyC-Dev earlier there was no problem with this endpoint acs-mirror.azureedge.net.

I also created support ticket in MS so if I have an answer I will let you know.

cloudziu commented 5 months ago

@rahimek Resolved on my site. We had to whitelist this address acs-mirror.azureedge.net in our Firewall. There was probably an address change in the deployment scripts or something.

~If so, this is possibly related: MicrosoftDocs/azure-docs#123359~

Hmmm, but when I have a closer look, the domain you are mentioning was in the FQDN list quite a long time ago. (At least from when I started getting familiar with AKS, it is already there.) https://learn.microsoft.com/en-us/azure/aks/outbound-rules-control-egress#azure-global-required-fqdn--application-rules

Hey @JoeyC-Dev, honestly I am also suprised it worked before. I would assume that the CNI was pulled from mcr.microsoft.com. Anyway thanks for the shared resources.

rahimek commented 4 months ago

@rahimek Resolved on my site. We had to whitelist this address acs-mirror.azureedge.net in our Firewall. There was probably an address change in the deployment scripts or something.

@cloudziu I have one question. Did you have to whitelist address acs-mirror.azureedge.net from your aks vnet address space? or from your pod_cidr?

cloudziu commented 4 months ago

Hey @rahimek, in my case from the VNET where the VMSS is created. Nodes need access to be able to download required binaries, in this particular case the CNI. I can advice you to ssh into the VM that is created by AKS and browse /var/log directory. There is plenty of logs that helped me drill down to the core issue.

rahimek commented 4 months ago

Thank you very much!

rahimek commented 4 months ago

Ok, on our site is also resolved now. Apart from @cloudziu wrote (whitelist endpoints on firewall - in our case on proxy) we had to add our custom CA certificates to system node pool. In terraform it is parameter called custom_ca_trust_certificates_base64

microsoft-github-policy-service[bot] commented 3 months ago

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure