Closed cloudziu closed 3 months ago
Hi!
I have the same problem in our environment when we try to create AKS cluster via terraform. Terraform tries to create the cluster for an hour and a half and then throws an error.
In Activity log in Azure Portal I noticed exactly the same error message like your @cloudziu . I guess this is a bug on the Azure side, because one week ago we were able to create aks clusters without any problems using the same terraform module like now.
@rahimek Resolved on my site. We had to whitelist this address acs-mirror.azureedge.net
in our Firewall. There was probably an address change in the deployment scripts or something.
@rahimek Resolved on my site. We had to whitelist this address
acs-mirror.azureedge.net
in our Firewall. There was probably an address change in the deployment scripts or something.
If so, this is possibly related: https://github.com/MicrosoftDocs/azure-docs/pull/123359
Hmmm, but when I have a closer look, the domain you are mentioning was in the FQDN list quite a long time ago. (At least from when I started getting familiar with AKS, it is already there.) https://learn.microsoft.com/en-us/azure/aks/outbound-rules-control-egress#azure-global-required-fqdn--application-rules
Thanks for your replies. Actually the documentation says that is related to connection problems (https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/error-code-cnidownloadtimeoutvmextensionerror). But like you said @JoeyC-Dev earlier there was no problem with this endpoint acs-mirror.azureedge.net.
I also created support ticket in MS so if I have an answer I will let you know.
@rahimek Resolved on my site. We had to whitelist this address
acs-mirror.azureedge.net
in our Firewall. There was probably an address change in the deployment scripts or something.~If so, this is possibly related: MicrosoftDocs/azure-docs#123359~
Hmmm, but when I have a closer look, the domain you are mentioning was in the FQDN list quite a long time ago. (At least from when I started getting familiar with AKS, it is already there.) https://learn.microsoft.com/en-us/azure/aks/outbound-rules-control-egress#azure-global-required-fqdn--application-rules
Hey @JoeyC-Dev, honestly I am also suprised it worked before. I would assume that the CNI was pulled from mcr.microsoft.com
. Anyway thanks for the shared resources.
@rahimek Resolved on my site. We had to whitelist this address
acs-mirror.azureedge.net
in our Firewall. There was probably an address change in the deployment scripts or something.
@cloudziu I have one question. Did you have to whitelist address acs-mirror.azureedge.net from your aks vnet address space? or from your pod_cidr?
Hey @rahimek, in my case from the VNET where the VMSS is created. Nodes need access to be able to download required binaries, in this particular case the CNI.
I can advice you to ssh into the VM that is created by AKS and browse /var/log
directory. There is plenty of logs that helped me drill down to the core issue.
Thank you very much!
Ok, on our site is also resolved now. Apart from @cloudziu wrote (whitelist endpoints on firewall - in our case on proxy) we had to add our custom CA certificates to system node pool. In terraform it is parameter called custom_ca_trust_certificates_base64
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Describe the bug When creating AKS cluster, the default node pool cannot provision sucesfully. There is a Failure information in the VMSS Azure Activity Log. Because of that AKS is stuck in
Creating
state, same the VMSS.I have not provided any custom scripts to the extensions.
This is the status message from the Activity Log error:
Operation name: Create or Update Virtual Machine Scale Set
Event initiated by: AzureContainerService
Error code: ResourceOperationFailure
Message: The resource operation completed with terminal provisioning state 'Failed'.
Things that I've tested: