kube-aws: single point of failure in kube-dns

danielfm commented 8 years ago

As of now, we are currently using just a single kube-dns replica.

Isn't this a problem? If the kube-dns pod fails for some reason, all DNS resolution requests would fail until that pod is restarted (or worse, it might take even more time to get the kube-dns pod up if it gets rescheduled in a different node due to resource starvation or something like that).

That being said, wouldn't be best to start at least two replicas there?

aaronlevy commented 8 years ago

So far we've left this to an operator to decide, and it should be as simple as:

kubectl scale rc kube-dns --replicas=X

I feel like this is still important for an operator to be aware of, as "having X copies" still doesn't mean you're necessarily HA. Consider we bootstrap a single node cluster and launch 2 DNS pods, it will end up scheduling them to the same machine. If we add machines to the cluster, it will not reschedule those pods to spread them across the new nodes (although re-scheduling is something which is being discussed upstream).

Maybe this would be best covered in some form of "operators" documentation?

danielfm commented 8 years ago

So far we've left this to an operator to decide, and it should be as simple as:
kubectl scale rc kube-dns --replicas=X

That's what I did in my cluster aside from changing userdata/cloud-config-controller so that I don't forget about this in further updates, of course. 😄

I feel like this is still important for an operator to be aware of, as "having X copies" still doesn't mean you're necessarily HA.

That's exactly why I opened this issue. As someone who hangs in #kubernetes-users a lot, it's rather common to see people running into problems because of not using enough replicas there.

Since kube-aws "just works", people don't seem to be aware that the DNS is a hugely important component of a Kubernetes cluster, and that things will get ugly if they forget about it as their workload increase.

Consider we bootstrap a single node cluster and launch 2 DNS pods, it will end up scheduling them to the same machine. If we add machines to the cluster, it will not reschedule those pods to spread them across the new nodes (although re-scheduling is something which is being discussed upstream).

That is a nasty corner case indeed, although in most real-world use cases people will create multi-node multi-AZ clusters, and the scheduler will most likely spread those pods in a way that this does not happen.

Since having multiple DNS replicas for single-node clusters does not make much sense, we might specify a sane default depending on the number of workers, like:

1 replica for single-node clusters
2 replicas multi-node clusters

(I thought about coming up with a way to calculate the optimal number of replicas based on the cluster size, but this not possible because it very much depends on a number of factors beyond the number of nodes.)

Maybe this would be best covered in some form of "operators" documentation?

Regardless of the solution we choose to implement (if any), explaining the importance of the cluster DNS service for a healthy cluster operation is a must.

cgag commented 8 years ago

I think upping it to at least 2 is a good idea. Making it a daemonset seems even better to me as well for avoiding all copies being scheduled on one node. I don't really like the idea of leaving at 1 replica and leaving scaling it as an exercise to the operator.

It looks like making it a daemonset has been discussed upstream: https://github.com/kubernetes/kubernetes/issues/26707

We've also had a kube-aws user run into DNS lookups as a bottleneck when only running one kube-dns pod: https://github.com/coreos/coreos-kubernetes/issues/533

Thermi commented 8 years ago

I did this on the cluster I'm operating. The required changes are: make the RC a DS. For that, change the apiversion to the one supporting DS use the "app" selector in the DS and the SVC, instead of "k8s-app". "k8s-app" is not supported in the apiversion that supports the DS.

Definition of the DS:

        {
          "apiVersion": "extensions/v1beta1",
          "kind": "DaemonSet",
          "metadata": {
            "name": "kube-dns-v15",
            "namespace": "kube-system",
            "labels": {
              "app": "kube-dns",
              "version": "v15",
              "kubernetes.io/cluster-service": "true"
            }
          },
          "spec": {
            "template": {
              "metadata": {
                "labels": {
                  "app": "kube-dns",
                  "version": "v15",
                  "kubernetes.io/cluster-service": "true"
                }
              },
              "spec": {
                "containers": [
                  {
                    "name": "kubedns",
                    "image": "gcr.io/google_containers/kubedns-amd64:1.7",
                    "resources": {
                      "limits": {
                        "cpu": "100m",
                        "memory": "200Mi"
                      },
                      "requests": {
                        "cpu": "100m",
                        "memory": "100Mi"
                      }
                    },
                    "livenessProbe": {
                      "httpGet": {
                        "path": "/healthz",
                        "port": 8080,
                        "scheme": "HTTP"
                      },
                      "initialDelaySeconds": 60,
                      "timeoutSeconds": 5,
                      "successThreshold": 1,
                      "failureThreshold": 5
                    },
                    "readinessProbe": {
                      "httpGet": {
                        "path": "/readiness",
                        "port": 8081,
                        "scheme": "HTTP"
                      },
                      "initialDelaySeconds": 30,
                      "timeoutSeconds": 5
                    },
                    "args": [
                      "--domain=cluster.local.",
                      "--dns-port=10053"
                    ],
                    "ports": [
                      {
                        "containerPort": 10053,
                        "name": "dns-local",
                        "protocol": "UDP"
                      },
                      {
                        "containerPort": 10053,
                        "name": "dns-tcp-local",
                        "protocol": "TCP"
                      }
                    ]
                  },
                  {
                    "name": "dnsmasq",
                    "image": "gcr.io/google_containers/dnsmasq:1.1",
                    "args": [
                      "--cache-size=1000",
                      "--no-resolv",
                      "--server=127.0.0.1#10053"
                    ],
                    "ports": [
                      {
                        "containerPort": 53,
                        "name": "dns",
                        "protocol": "UDP"
                      },
                      {
                        "containerPort": 53,
                        "name": "dns-tcp",
                        "protocol": "TCP"
                      }
                    ]
                  },
                  {
                    "name": "healthz",
                    "image": "gcr.io/google_containers/exechealthz-amd64:1.0",
                    "resources": {
                      "limits": {
                        "cpu": "10m",
                        "memory": "20Mi"
                      },
                      "requests": {
                        "cpu": "10m",
                        "memory": "20Mi"
                      }
                    },
                    "args": [
                      "-cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null",
                      "-port=8080"
                    ],
                    "ports": [
                      {
                        "containerPort": 8080,
                        "protocol": "TCP"
                      }
                    ]
                  }
                ],
                "dnsPolicy": "Default"
              }
            }
          }
        }

Definition of the SVC:

        {
          "apiVersion": "v1",
          "kind": "Service",
          "metadata": {
            "labels": {
              "app": "kube-dns",
              "kubernetes.io/cluster-service": "true",
              "kubernetes.io/name": "KubeDNS"
            },
            "name": "kube-dns",
            "namespace": "kube-system"
          },
          "spec": {
            "clusterIP": "{{.DNSServiceIP}}",
            "ports": [
              {
                "name": "dns",
                "port": 53,
                "protocol": "UDP"
              },
              {
                "name": "dns-tcp",
                "port": 53,
                "protocol": "TCP"
              }
            ],
            "selector": {
              "app": "kube-dns"
            }
          }
        }

dzavalkinolx commented 8 years ago

@aaronlevy Are you going to switch from replication controller to daemonset soon? We have faced this issue a couple of times already: if controller node is down or there is some issue with it cluster doesn't work. From our point of view kube-dns daemonset is an essential part of Kubernetes HA. @Thermi posted a solution above, I can create a pull request out of it if this works better for you. DNS HA should be addressed as soon as possible.

sutinski commented 7 years ago

Let’s consider a scenario: 3 kube-dns pods are running as part of DaemonSet on 3 Kubernetes master nodes. Kube-dns service is distributing load using simple round-robin algorithm implemented in iptables. IP address of kube-dns service is in each container’s resolv.conf file. 1) Imagine that 1 kube-dns pod fails for any reason. Iptables will continue to diastribute DNS requests among 3 IP addresses which means that every third DNS request won’t be served and client will see a request timeout. Even if health check is implemented inside kube-dns pod still 1/3 of DNS requests won’t be served for as long as healthcheck_timeout+pod_restart_time. 2) Consider an application that uses Unix getaddrinfo() function for name resolution. Standard behavior of this function is to try once per each DNS server entry in resolv.conf. Having single resolv.conf entry pointing to kube-dns service getaddrinfo() will fail with 1/3 probability if one out of 3 kube-dns pods is not responding. The rest depends on application logic. Some apps may retry and continue, others (in our case pgpool2) fail a database cluster node if it cannot resolve node IP address.

This could be resolved by using secondary DNS entry in resolv.conf. Unfortunately, current design of DNS subsystem assumes only one DNS authority capable to resolve internal names.

jianzi123 commented 7 years ago

@sergei-utinski could you expain "Unfortunately, current design of DNS subsystem assumes only one DNS authority capable to resolve internal names.", our team plan to start 2 DNS server by kubelet, and up words puzzles me...

sutinski commented 7 years ago

@jianzi123: It all depends on your DNS setup and the way you configure your containers. If you have 2 independent DNS servers capable of resolving internal cluster names (in svc.cluster.local namespace) and then instruct kubelet to pass both of them to your containers then you resolv.conf will have at least 2 IPs to resolve internal names and getaddrinfo() will retry with second IP if first fails.

In my comment above was considering traditional kube-dns add-on configuration when several DNS containers (kube-dns pods) are hidden behind single service IP and this IP is the only one present in container's resolv.conf. In this case if DNS request hits a dead container no retries happen and query fails, regardless of number of still healthy DNS containers running.

jianzi123 commented 7 years ago

@sergei-utinski thanks for your answer and i added some code to start 2 independent DNS servers with 2 IPs and then it works.

chhuang0123 commented 7 years ago

@jianzi123: could you provide your yaml file for reference?

I am not sure my understanding is correct?

It means that we could create 2 kube-dns services, for example: 1.2.3.4 and 1.2.3.5. (Each service has 2 pods...) We could pass 1.2.3.4 and 1.2.3.5 DNS server IPs to kubelet --cluster-dns=1.2.3.4,1.2.3.5

jianzi123 commented 7 years ago

sorry, i forgot this issue... skydns-svc.yaml apiVersion: v1 kind: Service metadata: name: kube-dns1 namespace: kube-system labels: k8s-app: kube-dns1 kubernetes.io/cluster-service: "true" kubernetes.io/name: "KubeDNS" spec: selector: k8s-app: kube-dns1 clusterIP: 10.254.100.101 ports:

name: dns port: 53 protocol: UDP
name: dns-tcp port: 53 protocol: TCP

coreos / coreos-kubernetes

kube-aws: single point of failure in kube-dns #641