cloudnativelabs / kube-router

Kube-router, a turnkey solution for Kubernetes networking.
https://kube-router.io
Apache License 2.0
2.31k stars 468 forks source link

DSR not working with Rocky / RHEL: Unavailable desc = name resolver error: produced zero addresses #1754

Closed opipenbe closed 3 minutes ago

opipenbe commented 1 day ago

What happened?

DSR mode is not working with Rocky / RHEL 9 if using kube-router v2.2.0 and v2.2.1.

In the kube-router logs: E1020 08:05:55.567390 372833 service_endpoints_sync.go:60] Error setting up IPVS services for service external IP's and load balancer IP's: failed to setup DSR endpoint 30.0.0.1: unable to setup DSR receiver inside pod: failed to prepare endpoint 192.168.6.78 to do DSR due to: rpc error: code = Unavailable desc = name resolver error: produced zero addresses

DSR is working successfully with kube-router v2.1.3 with Rocky Linux 9.4. Error above occurs with kube-router v2.2.0 and v2.2.1. I believe a change between kube-router v2.1.3 and v2.2.0 created this incompatibility for DSR. I also tested DSR with Ubuntu 24.04 & kube-router v2.2.x in the same cluster and it does not have such issue.

What did you expect to happen?

DSR mode enabled without errors for RHEL and its clones.

How can we reproduce the behavior you experienced?

Steps to reproduce the behavior:

  1. Install kubeadm k8s cluster without kube-proxy and with cri-o runtime.
  2. Deploy latest kube-router v2.2.1 using kubeadm-kuberouter-all-features-dsr.yaml (https://github.com/cloudnativelabs/kube-router/blob/master/daemonset/kubeadm-kuberouter-all-features-dsr.yaml).
  3. Make sure to make following changes in kubeadm-kuberouter-all-features-dsr.yaml:
    • set --runtime-endpoint=unix:///run/crio/crio.sock
    • replace /var/run/docker.sock with /run/crio/crio.sock in volumeMounts and volumes configuration
    • instead of:
      - name: kubeconfig
      configMap:
        name: kube-proxy
        items:
        - key: kubeconfig.conf
          path: kubeconfig

      replace with:

      - name: kubeconfig
      hostPath:
        path: /var/lib/kube-router

System Information (please complete the following information)

aauren commented 22 hours ago

This appears to have happened when we switch grpc implementations from grpc.DialContext() to grpc.NewClient(). This fundamentally changed the resolver from the passthrough resolver to the dns resolver.

We can see this as the first thing that DialContext() does when it enters the function: https://github.com/grpc/grpc-go/blob/98959d9a4904e98bbf8b423ce6a3cb5d36f90ee1/clientconn.go#L228

We probably need to force the passthrough resolver to fix this problem.

aauren commented 21 hours ago

@opipenbe can you try the fix on #1756 and let me know how it works for you?

opipenbe commented 13 hours ago

Thank you @aauren ! I just built image from #1756 and it resolved this issue.