Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] Pod to pod connectivity fails on arm64 based cluster using Azure CNI Powered by Cilium #3993

Open JoooostB opened 10 months ago

JoooostB commented 10 months ago

Describe the bug Communication between pods in the Kubernetes cluster results in a time-out when using ARM-based nodes in combination with Azure CNI Powered by Cilium. Core components tend to break, as kube-dns/core-dns are unreachable from pods on other nodes. Notably, communication to the internet or between nodes seem to function correctly, although you have to resort to a different DNS server for resolution purposes.

To Reproduce Steps to reproduce the behavior:

  1. Create a cluster with at least two nodes arm_based nodes (I used Standard_D2ps_v5), and Cilium as network-dataplane and azure as network-plugin:

    az aks create -n arm-debugging -g $resourceGroup -l westeurope \
    --max-pods 250 \
    --node-count 2 \
    --network-plugin azure \
    --os-sku AzureLinux \
    --node-vm-size Standard_D2ps_v5 \
    --kubernetes-version "1.27.3" \
    --network-plugin-mode overlay \
    --vnet-subnet-id $subnetId \
    --network-dataplane cilium

    I tried both Ubuntu as AzureLinux os_sku, with no difference in behaviour.

  2. Start a debugging deployment/pod with tools like cURL, wget, dig, nslookup etc on the second node, as the first node tends to work due to the presence of kube-dnson that specific node.

  3. Try to resolve services, reach another pod from your debug pod or just try an apt-update and notice your DNS-lookups getting a time-out.

Another way to reproduce the issue is by running the Cilium connectivity test, which fails almost instantly on DNS-resolution as well:

cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
ℹ️  Skipping tests that require a node Without Cilium
⌛ [management-arm] Waiting for deployment cilium-test/client to become ready...
⌛ [management-arm] Waiting for deployment cilium-test/client2 to become ready...
⌛ [management-arm] Waiting for deployment cilium-test/echo-same-node to become ready...
⌛ [management-arm] Waiting for deployment cilium-test/echo-other-node to become ready...
⌛ [management-arm] Waiting for CiliumEndpoint for pod cilium-test/client-6b4b857d98-dx9h2 to appear...
⌛ [management-arm] Waiting for CiliumEndpoint for pod cilium-test/client2-646b88fb9b-sh5mg to appear...
⌛ [management-arm] Waiting for pod cilium-test/client-6b4b857d98-dx9h2 to reach DNS server on cilium-test/echo-same-node-557b988b47-svd4w pod...
⌛ [management-arm] Waiting for pod cilium-test/client2-646b88fb9b-sh5mg to reach DNS server on cilium-test/echo-same-node-557b988b47-svd4w pod...
⌛ [management-arm] Waiting for pod cilium-test/client-6b4b857d98-dx9h2 to reach DNS server on cilium-test/echo-other-node-78455455d5-zrbvj pod...
connectivity test failed: timeout reached waiting for lookup for localhost from pod cilium-test/client-6b4b857d98-dx9h2 to server on pod cilium-test/echo-other-node-78455455d5-zrbvj to succeed (last error: context deadline exceeded)

Expected behavior Pods should be able to reach other services, as they do on x86 based nodes.

Environment (please complete the following information):

Additional context I found out the culprit was the arm64 architecture by deploying a new cluster using the CLI for debugging purposes, and to my surprise, it worked seamlessly. This prompted a comparison between this new temporary, but functional cluster, and my existing one.

Revealing a singular difference – the vm_size, which was an x86 variant instead of the earlier deployed arm64. Subsequently, adjusting the vm_size to an x86 model on our existing cluster resulted in functioning cluster again, where communication between pods worked flawlessly. But in my opinion, this is more of a workaround than a permanent solution.

Our intended workloads are optimised for arm, resorting back to x86 is a massive downgrade.

bramvdklinkenberg commented 10 months ago

Tested this, same issues. With Standard_D2s_v3 no problems at all. Also tried it without vnet integration, but same result

# DOES NOT WORK!
az aks create -n aks-cilium-test-brammetje-2 \
  -g aks-rg \
  -l westeurope \
  --os-sku Ubuntu \
  --node-vm-size Standard_D2ps_v5 \
  --max-pods 250 \
  --node-count 3 \
  --network-plugin 'azure' \
  --network-dataplane 'cilium' \
  --network-plugin-mode overlay \
  --kubernetes-version "1.27.3" \
  --vnet-subnet-id "${SUBNET_2} \
  --generate-ssh-keys
# WORKS!
az aks create -n aks-cilium-test-brammetje-3 \
  -g aks-rg \
  -l westeurope \
  --os-sku Ubuntu \
  --node-vm-size Standard_D2s_v3 \
  --max-pods 250 \
  --node-count 3 \
  --network-plugin 'azure' \
  --network-dataplane 'cilium' \
  --network-plugin-mode overlay \
  --kubernetes-version "1.27.3" \
  --vnet-subnet-id "${SUBNET_3} \
  --generate-ssh-keys
johnvanhienen commented 7 months ago

Also running into this issue for months now. Any update on the timeline when this will be fixed?

palma21 commented 1 month ago

Apologies on the delay, this might be around the introduction, this should be working now. Any issues you still see? (do open support ticket if it's something that's affecting you that shouldn't)

JoooostB commented 2 weeks ago

@bramvdklinkenberg Any chance you could test again? I don't have access to an active Azure tenant anymore.