UCP Not showing accurate disk usage

ghost commented 6 years ago

Expected behavior

UCP should have accurate indication of worker disk usage

Actual behavior

Worker disk appears full despite UCP reporting available space

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

OK hostname=swarm-manager000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000003 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000004 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Done requesting diagnostics.
Your diagnostics session ID is 1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Please provide this session ID to the maintainer debugging your issue.

Steps to reproduce the behavior

Spin up docker cluster using beta template from https://github.com/docker/for-azure/issues/38 (worker instances are D3_V2)
Deploy a number of services (accumulated worker images are about 14GB)
Service deployments begin to fail with "No such image:\<image-name>"
Verify Image exists in DTR and is pullable

Log on to worker and attempt to pull image ( ~200MB image )


swarm-worker000003:~$ docker pull <image-name>: Pulling from <repo>
6d987f6f4279: Already exists 
d0e8a23136b3: Already exists 
5ad5b12a980e: Already exists 
275352573fee: Pull complete 
ffbeb13b7578: Pull complete 
027bb24d721d: Pull complete 
aa04d7355dfa: Extracting [==================================================>]  45.51MB/45.51MB
failed to register layer: Error processing tar file(exit status 1): mkdir /app/node_modules/@types/lodash/gt: no space left on device

6. Check disk space from worker

swarm-worker000003:~$ df -h Filesystem Size Used Available Use% Mounted on overlay 29.4G 17.5G 10.4G 63% / tmpfs 6.8G 4.0K 6.8G 0% /dev tmpfs 6.8G 0 6.8G 0% /sys/fs/cgroup tmpfs 6.8G 161.4M 6.7G 2% /etc /dev/sda1 29.4G 17.5G 10.4G 63% /home tmpfs 6.8G 161.4M 6.7G 2% /mnt shm 6.8G 0 6.8G 0% /dev/shm tmpfs 6.8G 161.4M 6.7G 2% /lib/firmware /dev/sda1 29.4G 17.5G 10.4G 63% /var/log /dev/sda1 29.4G 17.5G 10.4G 63% /etc/ssh tmpfs 6.8G 161.4M 6.7G 2% /lib/modules /dev/sda1 29.4G 17.5G 10.4G 63% /etc/hosts /dev/sda1 29.4G 17.5G 10.4G 63% /var/etc/hostname /dev/sda1 29.4G 17.5G 10.4G 63% /etc/resolv.conf /dev/sda1 29.4G 17.5G 10.4G 63% /var/etc/docker tmpfs 1.4G 1.3M 1.4G 0% /var/run/docker.sock /dev/sda1 29.4G 17.5G 10.4G 63% /var/lib/waagent tmpfs 6.8G 161.4M 6.7G 2% /usr/local/bin/docker /dev/sdb1 200.0G 119.0M 199.9G 0% /mnt/resource



7. Check UCP Dashboard
![image](https://user-images.githubusercontent.com/2473742/31799287-0471e4dc-b4ee-11e7-9ba6-765e83703391.png)

The fact that the disk is full at all with only 14GB of data seems likely related to #19, #29 
But unlike when we experienced #38 There was no indication from the dashboard (or even from the worker instance container itself) that some underlying storage resource was full (see ```df``` output above)

spanditcaa commented 6 years ago

@ddebroy some additional information - We are exhausting Inodes as shown below.

swarm-worker000003:~$ df -i
Filesystem              Inodes      Used Available Use% Mounted on
overlay                1966080   1960215      5865 100% /
tmpfs                  1792091       186   1791905   0% /dev
tmpfs                  1792091        15   1792076   0% /sys/fs/cgroup
tmpfs                  1792091      1884   1790207   0% /etc
/dev/sda1              1966080   1960215      5865 100% /home
tmpfs                  1792091      1884   1790207   0% /mnt
shm                    1792091         1   1792090   0% /dev/shm
tmpfs                  1792091      1884   1790207   0% /lib/firmware
/dev/sda1              1966080   1960215      5865 100% /var/log
/dev/sda1              1966080   1960215      5865 100% /etc/ssh
tmpfs                  1792091      1884   1790207   0% /lib/modules
/dev/sda1              1966080   1960215      5865 100% /etc/hosts
/dev/sda1              1966080   1960215      5865 100% /var/etc/hostname
/dev/sda1              1966080   1960215      5865 100% /etc/resolv.conf
/dev/sda1              1966080   1960215      5865 100% /var/etc/docker
tmpfs                  1792091       376   1791715   0% /var/run/docker.sock
/dev/sda1              1966080   1960215      5865 100% /var/lib/waagent
tmpfs                  1792091      1884   1790207   0% /usr/local/bin/docker
/dev/sdb1                  256        27       229  11% /mnt/resource

Based on https://github.com/moby/moby/issues/10613 we ran docker rmi $(docker images -q --filter "dangling=true") and this took inodes used down to 21%.

ddebroy commented 6 years ago

Seems like something is off with the VHD used by the template I pointed to earlier: it is not mounting /dev/sdb correctly. Will update with more findings.

spanditcaa commented 6 years ago

Thanks @ddebroy

ddebroy commented 6 years ago

Update: It turns out the template I referred to earlier https://download.docker.com/azure/17.06/17.06.2/Docker-DDC.tmpl points to VHD 1.0.9 which did not incorporate the enhancement to use the second larger sdb disk provisioned by Azure to mount /var/lib/docker. That enhancement is first being rolled out in the CE version and then based on how things are going, will roll be rolled out as part of the next EE release: 17.06.2-ee4.

Re-reading the above, it sounds like the df -h output was in sync with what UCP was reporting but the problem was the i-node exhaustion which docker rmi took care of, correct?

spanditcaa commented 6 years ago

Yes @ddebroy, but we ended up in a bad state between managers and workers - similar to what is described here: https://github.com/docker/swarm/issues/2044

Although we could pull down images after the docker rmi, the tasks wouldn't advance past 'assigned' state, and the workers were logging the following:

Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.

We tried provisioning a new worker (which failed to connect) and restarting the ucp agent and controller, to no avail.

At this point, we deleted the cluster again and may wait for 17.06.2-ee4. Is there an expected release date ?

ddebroy commented 6 years ago

Hmm .. I am not sure of the steps you took but a worker will never log the message

Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.

It is something a new manager logs when it is unable to join the swarm. Sounds like you were trying to bring up new manager nodes? Looking through your diagnostics logs from the initial message, the swarm appears to be in a stable state. I guess the swarm cluster ended up in a bad state once the inode issue appeared.

By any chance, is there a way, you can share steps to repro step (2) above in a manner as close to what you tried as possible: Deploy a number of services (accumulated worker images are about 14GB) that will allow us to repro your environment internally and investigate?

Regarding 17.06.2-ee-4: we are running into some delays with getting the VHDs (that work the way we want with 17.06.2-ee-4) published through Azure. Will update once that is done and we are ready.

spanditcaa commented 6 years ago

Sure - it is probably a side effect of node.js applications, where we have thousands of tiny files that make up the application. I'll see if I can locate a suitable example, otherwise I'll publish a sample for you that triggers the issue.

spanditcaa commented 6 years ago

@ddebroy - @jeffnessen mentioned he has a suitable test container for you.

docker-archive / for-azure