Closed geerlingguy closed 4 years ago
Fix for the Pi Dramble issue is in this comment: https://github.com/geerlingguy/raspberry-pi-dramble/issues/166#issuecomment-565895079
That fix doesn't seem to have resolved the entire issue.
The issue seems to be that when the Docker daemon is started/restarted (after installed inside the container), DNS resolution goes away inside the container.
I checked the /etc/resolv.conf
file and in both cases it's the same:
search c.travis-ci-prod-2.internal google.internal
nameserver 127.0.0.11
options attempts:3 ndots:0
But before running the docker role, I get a successful ping:
root@e749afc94541:/# ping www.google.com
PING www.google.com (172.217.0.4) 56(84) bytes of data.
64 bytes from ord38s04-in-f4.1e100.net (172.217.0.4): icmp_seq=1 ttl=54 time=11.8 ms
After running the role, I get:
root@e749afc94541:/# ping www.google.com
ping: www.google.com: Temporary failure in name resolution
I'm going to try to force-mount a read-only resolv.conf in the container with Google DNS to see if that'll fix things.
Well... that worked. We'll see if this test run succeeds!
Still not completely resolved. Testing inside the Travis CI environment now.
Can't find much there either. Just kubelet failing for unspecified reasons :/
After burning a couple more hours on this, and spending a little time digging through kubeadm, kubelet, and docker inside a running Travis CI environment. I couldn't find anything pointing to the actual problem.
It seems like kubelet would start, it would connect to the Docker daemon... and then... nothing!
Very puzzling. I've seen kubelet fail in 500 different ways, but there was always some error message that would help. In this case, kubeadm
says "control plane never started" and kubelet says "connected to docker" then... nothing. Just starts spewing out the endless loops of 'can't get v1.Nodes, v1.Pods, v1.Services, forever and ever.
Docker was running fine on all the nodes, and I compared this setup to the very similar one running for raspberry-pi-dramble, and couldn't find any other issue.
The only difference, really, is that this setup was using a three-container setup managed by docker-compose, whereas the Drupal VM one used a single container and ran the playbook inside the container. But even so, the master never initialized all the way, and I could never figure out why.
I solved the AUFS problem. I set the docker daemon to use systemd instead of cgroups. I forced DNS to use 8.8.8.8 by mounting a resolv.conf file in the containers, and that fixed the internal docker restart DNS issues... but I couldn't figure out where to go next.
Therefore I give up, and I'll just lint this playbook for now. Testing will go against VirtualBox/Vagrant locally.
As a final note, I was also running into an issue with conntrack problems (docker not allowing file writes to update the resource size) when I was trying the DinD K8s cluster approach with multiple containers locally: https://github.com/kubernetes-retired/kubeadm-dind-cluster/issues/50
So it wasn't fun getting docker-in-docker with kubernetes-in-docker on multi-container cluster setup working in Travis CI. And though it would get up and running locally, kube-proxy would never start up because it couldn't write to that file.
Related: https://github.com/geerlingguy/raspberry-pi-dramble/issues/166
Symptom:
Deeper diagnosis from kubelet's logs:
Note that the related issue linked at the top of this post was not a problem until some time recently. Maybe the Travis CI platform changed a bit?
One thing to consider would be reinstalling a newer version of Docker since the version shipped in the python environment is like 18.06 or something...?