namliz commented 8 years ago

Follow up to #63.

Beginning of Problems

At some point in time a known-good deployment stopped succeeding on newly created clusters. This was caused by several disparate issues across several versions/configurations/components.

Init containers would not progress because service availability checks would fail
A service would appear to exist (kubectl get svc) and point at pods with correct endpoints (kubectl describe service)
Attaching to pods directly for inspection would show them operating as expected
Sometimes parts would succeed, but not uniformly and with no clear pattern

The first step to check if a service is working correctly is actually a simple DNS check (nslookup service). By chance, this would often appear to be functioning as expected indicating the problem must be elsewhere (not necessarily with kubernetes).

However, not to bury the lead: running nslookup on a loop would later expose that it was timing out sporadically. That is the sort of thing that makes a bug sinister as it misdirects debugging efforts away from the problem.

Known KubeDNS Issues Encountered

Secrets volume & SELinux permissions

SELinux context was missing 'svirt_sandbox_file_t' on the secretes volume and therefore from the perspective of the KubeDNS pod /var/run/secrets/kubernetes.io/serviceaccount/ was mangled and it couldn't in turn use that to connect to the master.
Secrets volume got stale

The kube-controller is responsible for injecting the secrets volume into pods and keeping it up to date. There were/are known bugs where it would fail to do that. As a result KubeDNS would mysteriously stop working because its tokens to connect to the master had grown stale. (This sort of thing: kubernetes/kubernetes#24928)
Typo

official skydns-rc.yaml had a typo at some point with --domain= missing the trailing dot.
Scalability

It is now recommended to scale KubeDNS pods proportionally to number of nodes in a cluster.

These problems would crop up and get resolved yet errors would stubbornly persist.

kubectl logs $(kubectl --namespace=kube-system get pods | tail -n1 | cut -d' ' -f1) --namespace=kube-system --container kubedns

I0829 20:19:21.696107       1 server.go:94] Using https://10.16.0.1:443 for kubernetes master, kubernetes API: <nil>
I0829 20:19:21.699491       1 server.go:99] v1.4.0-alpha.2.1652+c69e3d32a29cfa-dirty
I0829 20:19:21.699518       1 server.go:101] FLAG: --alsologtostderr="false"
I0829 20:19:21.699536       1 server.go:101] FLAG: --dns-port="10053"
I0829 20:19:21.699548       1 server.go:101] FLAG: --domain="cluster.local."
I0829 20:19:21.699554       1 server.go:101] FLAG: --federations=""
I0829 20:19:21.699560       1 server.go:101] FLAG: --healthz-port="8081"
I0829 20:19:21.699565       1 server.go:101] FLAG: --kube-master-url=""
I0829 20:19:21.699571       1 server.go:101] FLAG: --kubecfg-file=""
I0829 20:19:21.699577       1 server.go:101] FLAG: --log-backtrace-at=":0"
I0829 20:19:21.699584       1 server.go:101] FLAG: --log-dir=""
I0829 20:19:21.699600       1 server.go:101] FLAG: --log-flush-frequency="5s"
I0829 20:19:21.699607       1 server.go:101] FLAG: --logtostderr="true"
I0829 20:19:21.699613       1 server.go:101] FLAG: --stderrthreshold="2"
I0829 20:19:21.699618       1 server.go:101] FLAG: --v="0"
I0829 20:19:21.699622       1 server.go:101] FLAG: --version="false"
I0829 20:19:21.699629       1 server.go:101] FLAG: --vmodule=""
I0829 20:19:21.699681       1 server.go:138] Starting SkyDNS server. Listening on port:10053
I0829 20:19:21.699729       1 server.go:145] skydns: metrics enabled on : /metrics:
I0829 20:19:21.699751       1 dns.go:167] Waiting for service: default/kubernetes
I0829 20:19:21.700458       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0829 20:19:21.700474       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0829 20:19:26.691900       1 logs.go:41] skydns: failure to forward request "read udp 10.32.0.2:49468->172.20.0.2:53: i/o timeout"

Known Kubernetes Networking Issues Encountered

Initial Checks

Kubernetes imposes the following fundamental requirements on any networking implementation:

all containers can communicate with all other containers without NAT

all nodes can communicate with all containers (and vice-versa) without NAT

the IP that a container sees itself as is the same IP that others see it as

_{- Networking in Kubernetes}

In other words, to make sure networking is not seriously broken/misconfigured check:

Pods are being created / destroyed
Pods are able to ping each other

At first blush these were looking fine, but pod creation was sluggish (30-60 seconds), and that is a red flag.

Missing Dependencies

As described in #62, at some version CNI folder started missing binaries.

More undocumented dependencies (#64) were found from staring at logs and noting weirdness. The real important ones are (conntrack-tools, socat, bridge-utils), these things are now being pinned down upstream.

The errors were time consuming to understand because often their phrasing would leave something to be desired. Unfortunately there's at least one known false-positive warning (kubernetes/kubernetes#23385).

Cluster CIDR overlaps

--cluster-cidr="": CIDR Range for Pods in cluster. --service-cluster-ip-range="": CIDR Range for Services in cluster.

In my case services got a /16 starting on 10.0.0.0, the cluster-cidr got a 16 on 10.244.0.0. The service cidr is routable because kube-proxy is constantly writing iptable rules on every minion.

For Weave in particular --ipalloc-range needs to be passed to exactly match what's given to the Kubernetes cluster-cidr.

Whatever your network overlay, it must not clobber the service range!

Iptables masquerade conflicts

Flannel

If using Flannel be sure to follow the newly documented instructions: DOCKER_OPTS="--iptables=false --ip-masq=false"

Kube-proxy makes extensive use of masquerading rules, similar to an overlay clobbering the service range, another component (like the docker daemon itself) mucking about with masq rules will cause unexpected behavior.

Weave

Weave was originally erronously started with --docker-endpoint=unix:///var/run/weave/weave.sock which similarly caused unexpected behavior. This flag is extraneous and has to be omitted when used with CNI.

Final Configuration

Image

Centos7 source_ami: ami-bec022de

Dependencies

SELinux disabled.

Yum installed:

docker
etcd
conntrack-tools
socat
bridge-utils

kubernetes_version: 1.4.0-alpha.3 (b44b716965db2d54c8c7dfcdbcb1d54792ab8559)

weave_version: 1.6.1

1 Master (172.20.0.78)

Gist of journalctl output shows it boots fine, docker, etcd, kube-apiserver, scheduler, and controller all start. Minion registers successfully.

$ kubectl  get componentstatuses

NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok
controller-manager   Healthy   ok
etcd-0               Healthy   {"health": "true"}

$ kubectl get nodes 

NAME                                        STATUS    AGE
ip-172-20-0-18.us-west-2.compute.internal   Ready     1m

1 minion (172.20.0.18)

$ kubectl run -i --tty --image concourse/busyboxplus:curl dns-test42-$RANDOM --restart=Never /bin/sh

Pod created (not sluggishly). Multiple pods can ping each other.

Weave

Weave and weaveproxy are up and running just fine.

$ weave status

Version: 1.6.0 (version 1.6.1 available - please upgrade!)

        Service: router
       Protocol: weave 1..2
           Name: ce:1a:4b:b0:07:6d(ip-172-20-0-18)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 0
    Connections: 0
          Peers: 1
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.244.0.0/16
  DefaultSubnet: 10.244.0.0/16

        Service: proxy
        Address: unix:///var/run/weave/weave.sock

$ weave status ipam

ce:1a:4b:b0:07:6d(ip-172-20-0-18)        65536 IPs (100.0% of total)

Conclusion

Kubernetes is rapidly evolving with many open issues -- there are now efforts upstream to pin down and document the dependencies along with making errors and warnings more user-friendly in the logs.

As future versions become less opaque knowing which open issue is relevant to your setup will become easier. Along with whether an obvious dependency is missing and what a good setup looks like.

The nominal sanity check command that currently exists (kubectl get componentstatuses) does not go far enough. It might show everything is healthy. Pods might be successfully created. Services might work.

And yet these can all be misleading as a cluster may still not be entirely healthy.

A useful test I found in the official repo simply tests connectivity (and authentication) to the master. Sluggishness is not tested and sluggishness it turns out is a red flag.

In fact, there's an entire folder of these, but they are not well documented as far as I can tell.

I believe a smoke test that can deployed against any running cluster and run through a suite of checks and benchmarks (to take into account unexpectedly poor performance) would significantly improve the debugging experience.

dankohn commented 8 years ago

This is a great start. Could you please do another thorough edit, and then I'll give my comments over the phone Monday or Tuesday on where you're leaving out the story.

dankohn commented 8 years ago

In particular, on your reread, please look at the following things, which are quite distracting:

Please try to use capitalization consistently.
Please try to use section headers in a consistent fashion. Preferably, make each section header a complete thought that summarizes the contents of that section. If not, name it something that describes what will be discussed in that section.
Command text should be inside fenced code blocks, not just blockquoted.
Write a new conclusion. What is the takeaway from the experience? Why was it hard to debug? What are the specific lessons that would have saved you a week if you could have found this page via Google? Don't include the text (newly documented) and not make it a link. What is the extraneous flag you're referring to?

dankohn commented 8 years ago

Template: I expected x, I got y, based on url, I did z to fix the problem.

leecalcote commented 8 years ago

Possible other template sections:

Expected Behavior
Actual Behavior
Steps to Reproduce
Resolution / Fix
Related Issues

cncf / demo

Intermittent responses for Kubernetes service endpoints (postmortem) #103

Follow up to #63.

Beginning of Problems

Known KubeDNS Issues Encountered

Known Kubernetes Networking Issues Encountered

Initial Checks

Missing Dependencies

Cluster CIDR overlaps

Iptables masquerade conflicts

Flannel

Weave

Final Configuration

Image

Dependencies

1 Master (172.20.0.78)

1 minion (172.20.0.18)

Weave

Conclusion