CI test for local bare metal K8s cluster with Docker (in Docker) failing because of Travis CI AUFS problem

geerlingguy commented 4 years ago

Symptom:

TASK [geerlingguy.kubernetes : Configure Flannel networking.] ******************
544failed: [kube1] (item=kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml) => {"ansible_loop_var": "item", "changed": false, "cmd": ["kubectl", "apply", "-f", "https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml"], "delta": "0:00:01.071768", "end": "2019-12-16 01:18:21.672106", "item": "kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml", "msg": "non-zero return code", "rc": 1, "start": "2019-12-16 01:18:20.600338", "stderr": "unable to recognize \"https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml\": Get https://192.168.7.2:6443/api?timeout=32s: dial tcp 192.168.7.2:6443: connect: connection refused

Deeper diagnosis from kubelet's logs:

error creating aufs mount

Note that the related issue linked at the top of this post was not a problem until some time recently. Maybe the Travis CI platform changed a bit?

One thing to consider would be reinstalling a newer version of Docker since the version shipped in the python environment is like 18.06 or something...?

geerlingguy commented 4 years ago

Fix for the Pi Dramble issue is in this comment: https://github.com/geerlingguy/raspberry-pi-dramble/issues/166#issuecomment-565895079

geerlingguy commented 4 years ago

That fix doesn't seem to have resolved the entire issue.

geerlingguy commented 4 years ago

The issue seems to be that when the Docker daemon is started/restarted (after installed inside the container), DNS resolution goes away inside the container.

I checked the /etc/resolv.conf file and in both cases it's the same:

search c.travis-ci-prod-2.internal google.internal
nameserver 127.0.0.11
options attempts:3 ndots:0

But before running the docker role, I get a successful ping:

root@e749afc94541:/# ping www.google.com
PING www.google.com (172.217.0.4) 56(84) bytes of data.
64 bytes from ord38s04-in-f4.1e100.net (172.217.0.4): icmp_seq=1 ttl=54 time=11.8 ms

After running the role, I get:

root@e749afc94541:/# ping www.google.com
ping: www.google.com: Temporary failure in name resolution

I'm going to try to force-mount a read-only resolv.conf in the container with Google DNS to see if that'll fix things.

geerlingguy commented 4 years ago

Well... that worked. We'll see if this test run succeeds!

geerlingguy commented 4 years ago

Still not completely resolved. Testing inside the Travis CI environment now.

geerlingguy commented 4 years ago

Can't find much there either. Just kubelet failing for unspecified reasons :/

geerlingguy commented 4 years ago

After burning a couple more hours on this, and spending a little time digging through kubeadm, kubelet, and docker inside a running Travis CI environment. I couldn't find anything pointing to the actual problem.

It seems like kubelet would start, it would connect to the Docker daemon... and then... nothing!

Very puzzling. I've seen kubelet fail in 500 different ways, but there was always some error message that would help. In this case, kubeadm says "control plane never started" and kubelet says "connected to docker" then... nothing. Just starts spewing out the endless loops of 'can't get v1.Nodes, v1.Pods, v1.Services, forever and ever.

Docker was running fine on all the nodes, and I compared this setup to the very similar one running for raspberry-pi-dramble, and couldn't find any other issue.

The only difference, really, is that this setup was using a three-container setup managed by docker-compose, whereas the Drupal VM one used a single container and ran the playbook inside the container. But even so, the master never initialized all the way, and I could never figure out why.

I solved the AUFS problem. I set the docker daemon to use systemd instead of cgroups. I forced DNS to use 8.8.8.8 by mounting a resolv.conf file in the containers, and that fixed the internal docker restart DNS issues... but I couldn't figure out where to go next.

Therefore I give up, and I'll just lint this playbook for now. Testing will go against VirtualBox/Vagrant locally.

geerlingguy commented 4 years ago

As a final note, I was also running into an issue with conntrack problems (docker not allowing file writes to update the resource size) when I was trying the DinD K8s cluster approach with multiple containers locally: https://github.com/kubernetes-retired/kubeadm-dind-cluster/issues/50

So it wasn't fun getting docker-in-docker with kubernetes-in-docker on multi-container cluster setup working in Travis CI. And though it would get up and running locally, kube-proxy would never start up because it couldn't write to that file.

geerlingguy / ansible-for-kubernetes

CI test for local bare metal K8s cluster with Docker (in Docker) failing because of Travis CI AUFS problem #5