geerlingguy / raspberry-pi-dramble

DEPRECATED - Raspberry Pi Kubernetes cluster that runs HA/HP Drupal 8
http://www.pidramble.com/
MIT License
1.66k stars 260 forks source link

Travis CI tests failing with 'error creating aufs mount to /var/lib/docker/aufs/mnt' #166

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

Right before:

Nov 01 02:36:11 kube1 kubelet[6800]: I1101 02:36:11.109610    6800 kubelet_node_status.go:72] Attempting to register node kube1
665Nov 01 02:36:11 kube1 kubelet[6800]: E1101 02:36:11.110378    6800 kubelet_node_status.go:94] Unable to register node "kube1" with API server: Post https://172.17.0.2:6443/api/v1/nodes: dial tcp 172.17.0.2:6443: connect: connection refused
666Nov 01 02:36:11 kube1 kubelet[6800]: E1101 02:36:11.123886    6800 reflector.go:125] 

I see a lot of:

Nov 01 02:36:10 kube1 kubelet[6800]: E1101 02:36:10.960980    6800 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "kube-controller-manager-kube1": Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/ebd717708637c76c506f11e252593614484456d50f1b25db56f6b65929b51433-init: mount target=/var/lib/docker/aufs/mnt/ebd717708637c76c506f11e252593614484456d50f1b25db56f6b65929b51433-init data=br:/var/lib/docker/aufs/diff/ebd717708637c76c506f11e252593614484456d50f1b25db56f6b65929b51433-init=rw:/var/lib/docker/aufs/diff/63aee2a620d39fbe829fd32262038f4d878f8d162cab4ab139fa06d0d414e77d=ro+wh,dio,xino=/dev/shm/aufs.xino: invalid argument

More debug info on the running docker daemon:

Nov 01 02:36:10 kube1 kubelet[6800]: I1101 02:36:10.160993    6800 docker_service.go:258] Docker Info: &{ID:MTMB:7BZU:GURJ:NZTA:UPOY:CLGM:ODPN:F4AQ:V37L:V2SG:LBF2:7RJN Containers:0 ContainersRunning:0 ContainersPaused:0 ContainersStopped:0 Images:7 Driver:aufs DriverStatus:[[Root Dir /var/lib/docker/aufs] [Backing Filesystem overlayfs] [Dirs 12] [Dirperm1 Supported false]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host ipvlan macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:true KernelMemory:true KernelMemoryTCP:true CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:true IPv4Forwarding:true BridgeNfIptables:false BridgeNfIP6tables:false Debug:false NFd:24 OomKillDisable:true NGoroutines:36 SystemTime:2019-11-01T02:36:10.149098016Z LoggingDriver:json-file CgroupDriver:cgroupfs NEventsListener:0 KernelVersion:4.15.0-1028-gcp OperatingSystem:Debian GNU/Linux 10 (buster) (containerized) OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc000617960 NCPU:2 MemTotal:7836004352 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:kube1 Labels:[] ExperimentalBuild:false ServerVersion:19.03.1 ClusterStore: ClusterAdvertise: Runtimes:map[runc:{Path:runc Args:[]}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil> Warnings:[]} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID:b34a5c8af56e510852c35414db4c1f4fa6172339 Expected:b34a5c8af56e510852c35414db4c1f4fa6172339} RuncCommit:{ID:3e425f80a8c931f88e6d94a8c831b9d5aa481657 Expected:3e425f80a8c931f88e6d94a8c831b9d5aa481657} InitCommit:{ID:fec3683 Expected:fec3683} SecurityOptions:[name=seccomp,profile=default] ProductLicense: Warnings:[WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled WARNING: the aufs storage-driver is deprecated, and will be removed in a future release.]}
geerlingguy commented 4 years ago

One solution might be setting the storage driver to vfs... if, indeed, that's the problem. It might not be. But just checking on things by using docker info in Travis and in the started test container.

Also, see: https://docs.docker.com/storage/storagedriver/select-storage-driver/

Finu commented 4 years ago

Hi - I've lately run into this same error when trying to run molecule test on Travis - docker in docker. Did you manage somehow to overcome this issue ?

jobcespedes commented 4 years ago

Same here testing in Travis.

Travis VM uses storage driver: overlay2 Molecule docker uses storage driver: aufs

I am wondering why molecule docker installs with aufs. I'm using @geerlingguy docker images for testing in Travis and roles for docker installation.

geerlingguy commented 4 years ago

Over in https://github.com/geerlingguy/ansible-for-kubernetes/issues/5, I posited it might help to upgrade Docker CE inside the Travis CI environment first... attempting that now in https://github.com/geerlingguy/raspberry-pi-dramble/commit/ca3f2964d8dc99a8d5f7011b688c7fddc54e2987

geerlingguy commented 4 years ago

Interesting, when I run and have the kubeadm init command output returned via -vvvv I see the following stderr output:

  [WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/
  [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
  [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.1. Latest validated version: 18.09
  [WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: \"configs\", output: \"modprobe: ERROR: ../libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.15.0-1028-gcp/modules.dep.bin'\
modprobe: FATAL: Module configs not found in directory /lib/modules/4.15.0-1028-gcp\
\", err: exit status 1
"error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster"
geerlingguy commented 4 years ago

Also the stdout has some interesting info that led to me printing the output of journalctl -u kubelet:

[init] Using Kubernetes version: v1.15.7
[preflight] Running pre-flight checks
[preflight] The system verification failed. Printing the output from the verification:
\u001b[0;37mKERNEL_VERSION\u001b[0m: \u001b[0;32m4.15.0-1028-gcp\u001b[0m
\u001b[0;37mDOCKER_VERSION\u001b[0m: \u001b[0;32m19.03.1\u001b[0m
\u001b[0;37mOS\u001b[0m: \u001b[0;32mLinux\u001b[0m
\u001b[0;37mCGROUPS_CPU\u001b[0m: \u001b[0;32menabled\u001b[0m
\u001b[0;37mCGROUPS_CPUACCT\u001b[0m: \u001b[0;32menabled\u001b[0m
\u001b[0;37mCGROUPS_CPUSET\u001b[0m: \u001b[0;32menabled\u001b[0m
\u001b[0;37mCGROUPS_DEVICES\u001b[0m: \u001b[0;32menabled\u001b[0m
\u001b[0;37mCGROUPS_FREEZER\u001b[0m: \u001b[0;32menabled\u001b[0m
\u001b[0;37mCGROUPS_MEMORY\u001b[0m: \u001b[0;32menabled\u001b[0m
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"
[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder \"/etc/kubernetes/pki\"
[certs] Generating \"ca\" certificate and key
[certs] Generating \"apiserver\" certificate and key
[certs] apiserver serving cert is signed for DNS names [kube1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.17.0.2]
[certs] Generating \"apiserver-kubelet-client\" certificate and key
[certs] Generating \"front-proxy-ca\" certificate and key
[certs] Generating \"front-proxy-client\" certificate and key
[certs] Generating \"etcd/ca\" certificate and key
[certs] Generating \"etcd/server\" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kube1 localhost] and IPs [172.17.0.2 127.0.0.1 ::1]
[certs] Generating \"etcd/peer\" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kube1 localhost] and IPs [172.17.0.2 127.0.0.1 ::1]
[certs] Generating \"etcd/healthcheck-client\" certificate and key
[certs] Generating \"apiserver-etcd-client\" certificate and key
[certs] Generating \"sa\" key and public key
[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"
[kubeconfig] Writing \"admin.conf\" kubeconfig file
[kubeconfig] Writing \"kubelet.conf\" kubeconfig file
[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file
[kubeconfig] Writing \"scheduler.conf\" kubeconfig file
[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"
[control-plane] Creating static Pod manifest for \"kube-apiserver\"
[control-plane] Creating static Pod manifest for \"kube-controller-manager\"
[control-plane] Creating static Pod manifest for \"kube-scheduler\"
[etcd] Creating static Pod manifest for local etcd in \"/etc/kubernetes/manifests\"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
    timed out waiting for the condition

This error is likely caused by:
    - The kubelet is not running
    - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
    - 'systemctl status kubelet'
    - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
    - 'docker ps -a | grep kube | grep -v pause'
    Once you have found the failing container, you can inspect its logs with:
    - 'docker logs CONTAINERID'
geerlingguy commented 4 years ago

Upgrading Docker CE didn't work. So trying to configure overlayfs as the default driver instead... we'll see if that makes a difference.

jobcespedes commented 4 years ago

In my case, what worked and didn't work:

vm storage driver nested container storage driver worked
overlay2 aufs *no
overlay2 overlay2 no
aufs aufs yes

*The first one is the default config

geerlingguy commented 4 years ago

So... vm and nested both being overlay2 just now worked. To get this working in Travis CI, here's what I did:

  1. Update the Docker configuration file in Travis CI and restart Docker:
# If on Travis CI, update Docker's configuration.
if [ "$TRAVIS" == "true" ]; then
  mkdir /tmp/docker
  echo '{
    "experimental": true,
    "storage-driver": "overlay2"
  }' | sudo tee /etc/docker/daemon.json
  sudo service docker restart
fi
  1. In the docker run command, mount the docker daemon config into the container, and add a bind mount to /var/lib/docker so overlay2 doesn't error out:
docker run [...] \
  --volume=/etc/docker/daemon.json:/etc/docker/daemon.json:ro \
  --mount type=bind,src=/tmp/docker,dst=/var/lib/docker
geerlingguy commented 4 years ago

First successful build: https://travis-ci.org/geerlingguy/raspberry-pi-dramble/builds/625489080

Woohoo, now I can finally stop getting weekly 'your tests are still failing' emails again :)

geerlingguy commented 3 years ago

Just wanted to note that my awx build on GitHub Actions is now giving a similar error:

ERROR: for awx_redis  Cannot create container for service redis: error creating aufs mount

Over in https://github.com/moby/moby/issues/13742, I saw a comment mentioning:

Starting the daemon with storage-driver: vfs (/usr/bin/dockerd --storage-driver=vfs) solved the problem.

So a fix for a GH Actions workflow is adding a step like:

      - name: Force GitHub Actions' docker daemon to use vfs.
        run: |
          sudo systemctl stop docker
          echo '{"cgroup-parent":"/actions_job","storage-driver":"vfs"}' | sudo tee /etc/docker/daemon.json
          sudo systemctl start docker