coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

Reverse DNS look-ups are inconsistent #2160

Open benfuu opened 4 years ago

benfuu commented 4 years ago

I have deployed etcd-operator with helm and have the following cluster spec:

apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
  name: coredns-etcd-cluster
spec:
  size: 3

From my understanding based on the documentation here, etcd with TLS enabled will do a reverse lookup based on the ip address of the etcd pod to check if the incoming request is valid. However, when I run nslookup <PEER_IP_ADDR> from an etcd pod, I get inconsistent results:

/ # nslookup 10.11.3.99
nslookup: can't resolve '(null)': Name does not resolve

Name:      10.11.3.99
Address 1: 10.11.3.99 10-11-3-99.coredns-etcd-cluster-client.dns.svc.cluster.local
/ # nslookup 10.11.3.99
nslookup: can't resolve '(null)': Name does not resolve

Name:      10.11.3.99
Address 1: 10.11.3.99 coredns-etcd-cluster-t9rjxhtc96.coredns-etcd-cluster.dns.svc.cluster.local

Half the time, the reverse lookup will give the incorrect client service DNS name of the form pod-ip.coredns-etc-cluster-client.*. This will cause the peer TLS communication to fail since this is not of the form *.coredns-etcd-cluster.*.

I first discovered this on a newly created k8s cluster (v1.17.2) when trying to deploy Cilium with the managed etcd. Cilium internally uses the etcd-operator to create their etcd cluster and I saw the etcd pod logs flooded with these messages:

2020-02-14 03:29:44.693313 I | embed: rejected connection from "10.11.4.148:53696" (error "tls: \"10.11.4.148\" does not match any of DNSNames [\"*.cilium-etcd.kube-system.svc\" \"*.cilium-etcd.kube-system.svc.cluster.local\"]", ServerName "cilium-etcd-8svdg9rhbc.cilium-etcd.kube-system.svc", IPAddresses [], DNSNames ["*.cilium-etcd.kube-system.svc" "*.cilium-etcd.kube-system.svc.cluster.local"])

So I created my own etcd operator deployment and validated that from one etcd pod, a reverse lookup for the IP address of a peer etcd pod will return different values.

The only time that the reverse DNS lookup is consistent is when the pod is looking up its own DNS name since it is written into /etc/hosts.

Can somebody please help investigate to see if they can replicate this and if this issue lies with how etcd-operator is creating the etcd pods?

bmcustodio commented 4 years ago

I am also facing this issue (although I always get the same, but "wrong", result for reverse lookups). I have another cluster running an older version of CoreDNS (1.3.1) where this issue never happenned, so I thought it might be related to the version of CoreDNS in use (1.6.6). Turns out, it seems, that CoreDNS >= 1.6.0 will exhibit this behaviour, while CoreDNS <= 1.5.2 won't:

CoreDNS 1.5.2:

/ # host 10.150.27.157
157.27.150.10.in-addr.arpa domain name pointer cilium-etcd-fsbmhdkzgk.cilium-etcd.cilium.svc.cluster.local.

CoreDNS 1.6.0:

/ # host 10.150.27.157
157.27.150.10.in-addr.arpa domain name pointer 10-150-27-157.cilium-etcd-client.cilium.svc.cluster.local.