Open ghost opened 6 years ago
@ddebroy some additional information - We are exhausting Inodes as shown below.
swarm-worker000003:~$ df -i
Filesystem Inodes Used Available Use% Mounted on
overlay 1966080 1960215 5865 100% /
tmpfs 1792091 186 1791905 0% /dev
tmpfs 1792091 15 1792076 0% /sys/fs/cgroup
tmpfs 1792091 1884 1790207 0% /etc
/dev/sda1 1966080 1960215 5865 100% /home
tmpfs 1792091 1884 1790207 0% /mnt
shm 1792091 1 1792090 0% /dev/shm
tmpfs 1792091 1884 1790207 0% /lib/firmware
/dev/sda1 1966080 1960215 5865 100% /var/log
/dev/sda1 1966080 1960215 5865 100% /etc/ssh
tmpfs 1792091 1884 1790207 0% /lib/modules
/dev/sda1 1966080 1960215 5865 100% /etc/hosts
/dev/sda1 1966080 1960215 5865 100% /var/etc/hostname
/dev/sda1 1966080 1960215 5865 100% /etc/resolv.conf
/dev/sda1 1966080 1960215 5865 100% /var/etc/docker
tmpfs 1792091 376 1791715 0% /var/run/docker.sock
/dev/sda1 1966080 1960215 5865 100% /var/lib/waagent
tmpfs 1792091 1884 1790207 0% /usr/local/bin/docker
/dev/sdb1 256 27 229 11% /mnt/resource
Based on https://github.com/moby/moby/issues/10613 we ran docker rmi $(docker images -q --filter "dangling=true")
and this took inodes used down to 21%.
Seems like something is off with the VHD used by the template I pointed to earlier: it is not mounting /dev/sdb
correctly. Will update with more findings.
Thanks @ddebroy
Update: It turns out the template I referred to earlier https://download.docker.com/azure/17.06/17.06.2/Docker-DDC.tmpl points to VHD 1.0.9 which did not incorporate the enhancement to use the second larger sdb
disk provisioned by Azure to mount /var/lib/docker
. That enhancement is first being rolled out in the CE version and then based on how things are going, will roll be rolled out as part of the next EE release: 17.06.2-ee4.
Re-reading the above, it sounds like the df -h
output was in sync with what UCP was reporting but the problem was the i-node exhaustion which docker rmi
took care of, correct?
Yes @ddebroy, but we ended up in a bad state between managers and workers - similar to what is described here: https://github.com/docker/swarm/issues/2044
Although we could pull down images after the docker rmi
, the tasks wouldn't advance past 'assigned' state, and the workers were logging the following:
Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.
We tried provisioning a new worker (which failed to connect) and restarting the ucp agent and controller, to no avail.
At this point, we deleted the cluster again and may wait for 17.06.2-ee4. Is there an expected release date ?
Hmm .. I am not sure of the steps you took but a worker will never log the message
Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.
It is something a new manager logs when it is unable to join the swarm. Sounds like you were trying to bring up new manager nodes? Looking through your diagnostics logs from the initial message, the swarm appears to be in a stable state. I guess the swarm cluster ended up in a bad state once the inode issue appeared.
By any chance, is there a way, you can share steps to repro step (2) above in a manner as close to what you tried as possible: Deploy a number of services (accumulated worker images are about 14GB)
that will allow us to repro your environment internally and investigate?
Regarding 17.06.2-ee-4: we are running into some delays with getting the VHDs (that work the way we want with 17.06.2-ee-4) published through Azure. Will update once that is done and we are ready.
Sure - it is probably a side effect of node.js applications, where we have thousands of tiny files that make up the application. I'll see if I can locate a suitable example, otherwise I'll publish a sample for you that triggers the issue.
@ddebroy - @jeffnessen mentioned he has a suitable test container for you.
Expected behavior
UCP should have accurate indication of worker disk usage
Actual behavior
Worker disk appears full despite UCP reporting available space
Information
Steps to reproduce the behavior
Spin up docker cluster using beta template from https://github.com/docker/for-azure/issues/38 (worker instances are D3_V2)
Deploy a number of services (accumulated worker images are about 14GB)
Service deployments begin to fail with "No such image:\<image-name>"
Verify Image exists in DTR and is pullable
Log on to worker and attempt to pull image ( ~200MB image )
swarm-worker000003:~$ df -h Filesystem Size Used Available Use% Mounted on overlay 29.4G 17.5G 10.4G 63% / tmpfs 6.8G 4.0K 6.8G 0% /dev tmpfs 6.8G 0 6.8G 0% /sys/fs/cgroup tmpfs 6.8G 161.4M 6.7G 2% /etc /dev/sda1 29.4G 17.5G 10.4G 63% /home tmpfs 6.8G 161.4M 6.7G 2% /mnt shm 6.8G 0 6.8G 0% /dev/shm tmpfs 6.8G 161.4M 6.7G 2% /lib/firmware /dev/sda1 29.4G 17.5G 10.4G 63% /var/log /dev/sda1 29.4G 17.5G 10.4G 63% /etc/ssh tmpfs 6.8G 161.4M 6.7G 2% /lib/modules /dev/sda1 29.4G 17.5G 10.4G 63% /etc/hosts /dev/sda1 29.4G 17.5G 10.4G 63% /var/etc/hostname /dev/sda1 29.4G 17.5G 10.4G 63% /etc/resolv.conf /dev/sda1 29.4G 17.5G 10.4G 63% /var/etc/docker tmpfs 1.4G 1.3M 1.4G 0% /var/run/docker.sock /dev/sda1 29.4G 17.5G 10.4G 63% /var/lib/waagent tmpfs 6.8G 161.4M 6.7G 2% /usr/local/bin/docker /dev/sdb1 200.0G 119.0M 199.9G 0% /mnt/resource