Open JoooostB opened 10 months ago
Tested this, same issues. With Standard_D2s_v3 no problems at all. Also tried it without vnet integration, but same result
# DOES NOT WORK!
az aks create -n aks-cilium-test-brammetje-2 \
-g aks-rg \
-l westeurope \
--os-sku Ubuntu \
--node-vm-size Standard_D2ps_v5 \
--max-pods 250 \
--node-count 3 \
--network-plugin 'azure' \
--network-dataplane 'cilium' \
--network-plugin-mode overlay \
--kubernetes-version "1.27.3" \
--vnet-subnet-id "${SUBNET_2} \
--generate-ssh-keys
# WORKS!
az aks create -n aks-cilium-test-brammetje-3 \
-g aks-rg \
-l westeurope \
--os-sku Ubuntu \
--node-vm-size Standard_D2s_v3 \
--max-pods 250 \
--node-count 3 \
--network-plugin 'azure' \
--network-dataplane 'cilium' \
--network-plugin-mode overlay \
--kubernetes-version "1.27.3" \
--vnet-subnet-id "${SUBNET_3} \
--generate-ssh-keys
Also running into this issue for months now. Any update on the timeline when this will be fixed?
Apologies on the delay, this might be around the introduction, this should be working now. Any issues you still see? (do open support ticket if it's something that's affecting you that shouldn't)
@bramvdklinkenberg Any chance you could test again? I don't have access to an active Azure tenant anymore.
Describe the bug Communication between pods in the Kubernetes cluster results in a time-out when using ARM-based nodes in combination with Azure CNI Powered by Cilium. Core components tend to break, as kube-dns/core-dns are unreachable from pods on other nodes. Notably, communication to the internet or between nodes seem to function correctly, although you have to resort to a different DNS server for resolution purposes.
To Reproduce Steps to reproduce the behavior:
Create a cluster with at least two nodes arm_based nodes (I used
Standard_D2ps_v5
), and Cilium asnetwork-dataplane
and azure asnetwork-plugin
:Start a debugging deployment/pod with tools like cURL, wget, dig, nslookup etc on the second node, as the first node tends to work due to the presence of kube-dnson that specific node.
Try to resolve services, reach another pod from your debug pod or just try an apt-update and notice your DNS-lookups getting a time-out.
Another way to reproduce the issue is by running the Cilium connectivity test, which fails almost instantly on DNS-resolution as well:
Expected behavior Pods should be able to reach other services, as they do on x86 based nodes.
Environment (please complete the following information):
Additional context I found out the culprit was the arm64 architecture by deploying a new cluster using the CLI for debugging purposes, and to my surprise, it worked seamlessly. This prompted a comparison between this new temporary, but functional cluster, and my existing one.
Revealing a singular difference – the
vm_size
, which was an x86 variant instead of the earlier deployed arm64. Subsequently, adjusting thevm_size
to an x86 model on our existing cluster resulted in functioning cluster again, where communication between pods worked flawlessly. But in my opinion, this is more of a workaround than a permanent solution.Our intended workloads are optimised for arm, resorting back to x86 is a massive downgrade.