cncf / demo

Demo of CNCF technologies
https://cncf.io
Apache License 2.0
77 stars 39 forks source link

Intermittent responses for Kubernetes service endpoints (postmortem) #103

Closed namliz closed 7 years ago

namliz commented 7 years ago

Follow up to #63.

Beginning of Problems

At some point in time a known-good deployment stopped succeeding on newly created clusters. This was caused by several disparate issues across several versions/configurations/components.

The first step to check if a service is working correctly is actually a simple DNS check (nslookup service). By chance, this would often appear to be functioning as expected indicating the problem must be elsewhere (not necessarily with kubernetes).

However, not to bury the lead: running nslookup on a loop would later expose that it was timing out sporadically. That is the sort of thing that makes a bug sinister as it misdirects debugging efforts away from the problem.


Known KubeDNS Issues Encountered

These problems would crop up and get resolved yet errors would stubbornly persist.

kubectl logs $(kubectl --namespace=kube-system get pods | tail -n1 | cut -d' ' -f1) --namespace=kube-system --container kubedns

I0829 20:19:21.696107       1 server.go:94] Using https://10.16.0.1:443 for kubernetes master, kubernetes API: <nil>
I0829 20:19:21.699491       1 server.go:99] v1.4.0-alpha.2.1652+c69e3d32a29cfa-dirty
I0829 20:19:21.699518       1 server.go:101] FLAG: --alsologtostderr="false"
I0829 20:19:21.699536       1 server.go:101] FLAG: --dns-port="10053"
I0829 20:19:21.699548       1 server.go:101] FLAG: --domain="cluster.local."
I0829 20:19:21.699554       1 server.go:101] FLAG: --federations=""
I0829 20:19:21.699560       1 server.go:101] FLAG: --healthz-port="8081"
I0829 20:19:21.699565       1 server.go:101] FLAG: --kube-master-url=""
I0829 20:19:21.699571       1 server.go:101] FLAG: --kubecfg-file=""
I0829 20:19:21.699577       1 server.go:101] FLAG: --log-backtrace-at=":0"
I0829 20:19:21.699584       1 server.go:101] FLAG: --log-dir=""
I0829 20:19:21.699600       1 server.go:101] FLAG: --log-flush-frequency="5s"
I0829 20:19:21.699607       1 server.go:101] FLAG: --logtostderr="true"
I0829 20:19:21.699613       1 server.go:101] FLAG: --stderrthreshold="2"
I0829 20:19:21.699618       1 server.go:101] FLAG: --v="0"
I0829 20:19:21.699622       1 server.go:101] FLAG: --version="false"
I0829 20:19:21.699629       1 server.go:101] FLAG: --vmodule=""
I0829 20:19:21.699681       1 server.go:138] Starting SkyDNS server. Listening on port:10053
I0829 20:19:21.699729       1 server.go:145] skydns: metrics enabled on : /metrics:
I0829 20:19:21.699751       1 dns.go:167] Waiting for service: default/kubernetes
I0829 20:19:21.700458       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0829 20:19:21.700474       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0829 20:19:26.691900       1 logs.go:41] skydns: failure to forward request "read udp 10.32.0.2:49468->172.20.0.2:53: i/o timeout"

Known Kubernetes Networking Issues Encountered

Initial Checks

Kubernetes imposes the following fundamental requirements on any networking implementation:

  • all containers can communicate with all other containers without NAT
  • all nodes can communicate with all containers (and vice-versa) without NAT
  • the IP that a container sees itself as is the same IP that others see it as

- Networking in Kubernetes

In other words, to make sure networking is not seriously broken/misconfigured check:

At first blush these were looking fine, but pod creation was sluggish (30-60 seconds), and that is a red flag.

Missing Dependencies

As described in #62, at some version CNI folder started missing binaries.

More undocumented dependencies (#64) were found from staring at logs and noting weirdness. The real important ones are (conntrack-tools, socat, bridge-utils), these things are now being pinned down upstream.

The errors were time consuming to understand because often their phrasing would leave something to be desired. Unfortunately there's at least one known false-positive warning (kubernetes/kubernetes#23385).

Cluster CIDR overlaps

--cluster-cidr="": CIDR Range for Pods in cluster. --service-cluster-ip-range="": CIDR Range for Services in cluster.

In my case services got a /16 starting on 10.0.0.0, the cluster-cidr got a 16 on 10.244.0.0. The service cidr is routable because kube-proxy is constantly writing iptable rules on every minion.

For Weave in particular --ipalloc-range needs to be passed to exactly match what's given to the Kubernetes cluster-cidr.

Whatever your network overlay, it must not clobber the service range!

Iptables masquerade conflicts

Flannel

If using Flannel be sure to follow the newly documented instructions: DOCKER_OPTS="--iptables=false --ip-masq=false"

Kube-proxy makes extensive use of masquerading rules, similar to an overlay clobbering the service range, another component (like the docker daemon itself) mucking about with masq rules will cause unexpected behavior.

Weave

Weave was originally erronously started with --docker-endpoint=unix:///var/run/weave/weave.sock which similarly caused unexpected behavior. This flag is extraneous and has to be omitted when used with CNI.

Final Configuration

Image

Centos7 source_ami: ami-bec022de

Dependencies

SELinux disabled.

Yum installed:

kubernetes_version: 1.4.0-alpha.3 (b44b716965db2d54c8c7dfcdbcb1d54792ab8559)

weave_version: 1.6.1

1 Master (172.20.0.78)

Gist of journalctl output shows it boots fine, docker, etcd, kube-apiserver, scheduler, and controller all start. Minion registers successfully.

$ kubectl  get componentstatuses

NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok
controller-manager   Healthy   ok
etcd-0               Healthy   {"health": "true"}
$ kubectl get nodes 

NAME                                        STATUS    AGE
ip-172-20-0-18.us-west-2.compute.internal   Ready     1m

1 minion (172.20.0.18)

$ kubectl run -i --tty --image concourse/busyboxplus:curl dns-test42-$RANDOM --restart=Never /bin/sh

Pod created (not sluggishly). Multiple pods can ping each other.

Weave

Weave and weaveproxy are up and running just fine.

$ weave status

Version: 1.6.0 (version 1.6.1 available - please upgrade!)

        Service: router
       Protocol: weave 1..2
           Name: ce:1a:4b:b0:07:6d(ip-172-20-0-18)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 0
    Connections: 0
          Peers: 1
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.244.0.0/16
  DefaultSubnet: 10.244.0.0/16

        Service: proxy
        Address: unix:///var/run/weave/weave.sock
$ weave status ipam

ce:1a:4b:b0:07:6d(ip-172-20-0-18)        65536 IPs (100.0% of total)

Conclusion

Kubernetes is rapidly evolving with many open issues -- there are now efforts upstream to pin down and document the dependencies along with making errors and warnings more user-friendly in the logs.

As future versions become less opaque knowing which open issue is relevant to your setup will become easier. Along with whether an obvious dependency is missing and what a good setup looks like.

The nominal sanity check command that currently exists (kubectl get componentstatuses) does not go far enough. It might show everything is healthy. Pods might be successfully created. Services might work.

And yet these can all be misleading as a cluster may still not be entirely healthy.

A useful test I found in the official repo simply tests connectivity (and authentication) to the master. Sluggishness is not tested and sluggishness it turns out is a red flag.

In fact, there's an entire folder of these, but they are not well documented as far as I can tell.

I believe a smoke test that can deployed against any running cluster and run through a suite of checks and benchmarks (to take into account unexpectedly poor performance) would significantly improve the debugging experience.

dankohn commented 7 years ago

This is a great start. Could you please do another thorough edit, and then I'll give my comments over the phone Monday or Tuesday on where you're leaving out the story.

dankohn commented 7 years ago

In particular, on your reread, please look at the following things, which are quite distracting:

dankohn commented 7 years ago

Template: I expected x, I got y, based on url, I did z to fix the problem.

leecalcote commented 7 years ago

Possible other template sections:

  1. Expected Behavior
  2. Actual Behavior
  3. Steps to Reproduce
  4. Resolution / Fix
  5. Related Issues