CoreDNS timeout on vSphere cluster when resolve a service

ygao-armada commented 2 months ago

What happened: In EKSA cluster for vSphere, we have a strange error, on worker node, if we replace the /etc/resolv.conf with that from pod argocd-server-xxx:

search argocd.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.192.10
options ndots:5

The nslook up command will resolve the IP (10.96.221.1) first, then wait for 10 seconds til timeout

root@mgmt20-md-0-7k7hk-vcnh2:/home/ec2-user# nslookup argocd-redis
Server:     10.96.192.10
Address:    10.96.192.10#53

Name:  argocd-redis.argocd.svc.cluster.local
Address: 10.96.221.1
;; connection timed out; no servers could be reached

root@mgmt20-md-0-7k7hk-vcnh2:/home/ec2-user# exit

We can see the IP (10.96.221.1) is correct as follows:

ubuntu@ubuntuguest:~$ kubectl get svc -A -o wide | grep 10.96.221.1
argocd               argocd-redis                         ClusterIP  10.96.221.1   <none>    6379/TCP            135m  app.kubernetes.io/name=argocd-redis

And 10.96.192.10 is the coredns IP:

ubuntu@ubuntuguest:~$ kubectl get svc -A -o wide | grep 10.96.192.10
kube-system             kube-dns                           ClusterIP  10.96.192.10  <none>    53/UDP,53/TCP,9153/TCP     103d  k8s-app=kube-dns

Am I missing something?

What you expected to happen: No timeout should happen for command "nslookup argocd-redis"

How to reproduce it (as minimally and precisely as possible): Install argoCD on a EKSA vSphere cluster, and take the steps in above description.

Anything else we need to know?:

Environment:

EKS Anywhere Release:
EKS Distro Release:

sp1999 commented 2 months ago

Thanks for reporting @ygao-armada. We are looking into this issue and will get back with any information we find.

ygao-armada commented 2 months ago

@sp1999 Some update, I find it's related to gpu-operator, look like, if we install argocd before gpu-operator, there is no such issue. And I install argocd with:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

And I install gpu-operator with instruction from: https://github.com/NVIDIA/gpu-operator/blob/release-23.9/scripts/install-gpu-operator-nvaie.sh

aws / eks-anywhere

CoreDNS timeout on vSphere cluster when resolve a service #8144