docker-archive / for-aws

92 stars 26 forks source link

Docker for AWS can't download images from Docker Hub any more #47

Open goloroden opened 7 years ago

goloroden commented 7 years ago

I am running Docker for AWS 1.13 for a few months now, and after some initial problems it has now worked perfectly for quite some time. That is, until… ;-)

A few days ago I had a strange behavior: I stopped a container and tried to start a new version of it. The image for the container is in a private repository on Docker Hub. Anyway, I was not able to start it as a service, because Docker now tells me that it can't find the image, because the image doesn't exist.

But: The image actually exists, and I can pull it without any problems e.g. from my local machine.

I have now figured out that this applies to all containers running, I can not start a new version of any of them.

Any idea what's causing this behavior, and how to fix it?

kencochrane commented 7 years ago

@goloreden can you give us the names of the images you are having issues with and we will look to see what is causing the problem.

goloroden commented 7 years ago

Yes, of course: thenativeweb/enterjs2017, starting from version 0.2.43.

It doesn't matter if I try to pull an actual version, or just latest. Both can not be found.

kencochrane commented 7 years ago

@goloroden did you happen to change your password in hub recently?

goloroden commented 7 years ago

Unfortunately, no :-(

kencochrane commented 7 years ago

@goloroden weird, one thing to try.

  1. login to a manager node
  2. try to docker pull the image that you were having issues with.

Did it work? If not, do a docker login and login with creds that will be able to pull the images. Try a pull again, did it work that time?

I'm trying to see if the issue is with docker engine, or swarm. Hopefully this will narrow it down a little.

friism commented 7 years ago

I think you may have to docker login and then update the service with --with-registry-auth. See this issue: https://github.com/moby/moby/issues/24940

goloroden commented 7 years ago

I have tried to docker login again and to redeploy using the --with-registry-auth flag. Same effect as before: Image could not be found. As said, if I then try to run

$ docker pull thenativeweb/enterjs2017:0.2.44

locally, everything works fine.

If I login to one of the managers, and run the same command, it fails:

$ docker pull thenativeweb/enterjs2017:0.2.44
Error response from daemon: repository thenativeweb/enterjs2017 not found: does not exist or no pull access

If I now run docker login and login using the same credentials as on my local machine, and re-run it, it pulls the image as expected without any errors.

kencochrane commented 7 years ago

Ok, now that you can pull from that manager run the service update that @friism talked about above from the same manager, and see if that helps.

goloroden commented 7 years ago

I have tried that, and it does not help. I can pull the image without problems, but I can not create a service using the image. The image can not be found.

kencochrane commented 7 years ago

@goloroden ok, thanks, I wonder if there is a bug in swarm, where it lost the credentials for some reason.

one last thing to try, if you create a new service, that needs to pull private images, does that new service work?

goloroden commented 7 years ago

It doesn't work with any other private images as well.

But: I then tried to create a service using a public image (in this case, nginx), and this didn't work either. The error message then is:

No such image: nginx@sha256:4…

This looks quite strange, because of the @sha256: part… Is this normal?

kencochrane commented 7 years ago

@goloroden that is weird, it doesn't work for any images not even public ones, so that means we need to look at a different cause.

Since you can pull directly from the same host using docker pull, that rules out a network issue.

If you see it on all of the hosts in your swarm (manager and workers) then it isn't anything related to the one host.

Do you see anything from the docker logs in /var/log/docker/?

Can you run docker-diagnose so we can see what is going on in the cluster? more details on how to run docker-diagnose here: https://docs.docker.com/docker-for-aws/faqs/#where-do-i-report-problems-or-bugs

goloroden commented 7 years ago

I have checked the log, and there was nothing that caught my eye, but I'm not an expert in these logs.

I've run docker-diagnose, and the ID is 1496731860-WCt2TViUzAEn68tqYA2x5AzIow4i8rEI

kencochrane commented 7 years ago

@goloroden thanks, @nathanleclaire can you please take a look at the diagnose output, and see if you see anything that might be causing the issue?

goloroden commented 7 years ago

@kencochrane @nathanleclaire As always, thanks for your great support 😊

nathanleclaire commented 7 years ago

I don't see anything too interesting / obvious in the debug logs.

This issue reminds me a ton of https://github.com/moby/moby/issues/8376, right down to the fact that public images which should work perfectly fine stop working.

goloroden commented 7 years ago

Yes, this is true - but as said, I haven't changed my password recently.

Is there anything else I can do so that we can get closer to what's causing these issues?

nathanleclaire commented 7 years ago

Yes, I'd be surprised to find it's exactly the same, that's a quite old bug -- but I wouldn't be surprised to find out it's another bug related to distribution of registry credentials in the swarm somehow.

nathanleclaire commented 7 years ago

@goloroden Could I get you to attempt the service create (and maybe pull) which is failing, then run docker-diagnose immediately after? There are some repeated messages in the logs which may have pushed out useful info

goloroden commented 7 years ago

Yes, of course! I'm thankful for any help 😊

I've done what you asked for. The new diagnose ID is 1496773937-kPJBEHkM42bamNbgOGP3u7z4DlSPXAEj.

nathanleclaire commented 7 years ago

thanks, i'll take a look

nathanleclaire commented 7 years ago

ah, i think i see a likely suspect

nathanleclaire commented 7 years ago
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout:Jun  6 18:32:16 moby root: time="2017-06-06T18:32:16.378805119Z" level=error msg="Not continuing with pull after error: failed to register layer: Error processing tar file(exit status 1): open /app/node_modules/babel-helper-define-map/.npmignore: no space left on device"

might be out of disk space in most of the nodes?

nathanleclaire commented 7 years ago

hm, disk usage doesn't look too egregious though, so i'm am bit confused

/dev/xvdb1               19.7G      1.3G     17.4G   7% /var/lib/docker/overlay2
/dev/xvdb1               19.7G    958.8M     17.7G   5% /var/lib/docker/overlay2
/dev/xvdb1               19.7G    811.4M     17.9G   4% /var/lib/docker/overlay2
/dev/xvdb1               19.7G     11.1G      7.5G  60% /var/lib/docker/overlay2
/dev/xvdb1               19.7G     11.2G      7.5G  60% /var/lib/docker/overlay2
goloroden commented 7 years ago

Okay … this would explain a lot, but that would also mean that the setting Automatically clean up services (or similar … the one that regularly runs docker system prune) in the CloudFormation template does not work (or at least does not do what I expect it to do).

Unless there is another reason why the disk got filled up entirely … but I will check this out.

Is it safe to just kill the machine using AWS console? Will the Swarm cluster survive this, and start up a new one? In other words: What is the simplest way to make Swarm kill one machine and replace it by a new instance?

nathanleclaire commented 7 years ago

I also see the message you reference above:

6-06T18:32:06.921229190Z" level=error msg="fatal task error" error="No such image: thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23" module="node/agent/taskmanager" task.id=15i31rs8tjge1pllv0o6z7q50
nathanleclaire commented 7 years ago

seems it's version 17.03.0-ce correct?

goloroden commented 7 years ago

Yes, 17.03.0-ce-aws1 to be exact.

nathanleclaire commented 7 years ago

@stevvooe @aaronlehmann Do you have any idea why swarmkit might attempt to pull by SHA, and be told that such a SHA doesn't exist, if the user is only specifying a tag?

aaronlehmann commented 7 years ago

@nathanleclaire: It pulls by digest so that the same version will be pulled on each node.

The daemon log will show what it tried to pull and why it didn't succeed. I'd recommend taking a look at that.

Somewhat related: https://github.com/moby/moby/issues/33521

goloroden commented 7 years ago

Where do I find those daemon logs?

nathanleclaire commented 7 years ago

Thanks @aaronlehmann I'll see if I can dig up anything interesting in the surrounding logs

BTW, I see also messages like this, any idea what that might be about?

./ip-172-31-16-224-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:12 moby root: time="2017-06-06T18:32:12.133991105Z" level=warning msg="sending message to an unrecognized member ID 27e47f54051f679c" raft_id=4392a9b88600aa2b
./ip-172-31-16-224-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout:Jun  6 18:32:12 moby root: time="2017-06-06T18:32:12.134150664Z" level=error msg="could not resolve address of member ID 27e47f54051f679c" error="rpc error: code = 9 desc = grpc: the client connection is closing" raft_id=4392a9b88600aa2b
./ip-172-31-16-224-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:12 moby root: time="2017-06-06T18:32:12.135999311Z" level=debug msg="4392a9b88600aa2b [logterm: 0, index: 35400] rejected msgApp [logterm: 16, index: 35400] from 27e47f54051f679c"
./ip-172-31-16-224-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:12 moby root: time="2017-06-06T18:32:12.136150882Z" level=warning msg="sending message to an unrecognized member ID 27e47f54051f679c" raft_id=4392a9b88600aa2b
nathanleclaire commented 7 years ago

@goloroden the daemon logs are in /var/log/docker.log

Is it possible there's a very big node_modules or something like that in the image?

How large is it uncompressed?

nathanleclaire commented 7 years ago

@aaronlehmann here are some surrounding logs. it seems to be related to the disk space issue since the No such image seems to be a misnomer. The log clearly shows pulling thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23 OK -- it's running out of disk space that's the issue

./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:05 moby root: time="2017-06-06T18:32:05.024090413Z" level=debug msg="pull in progress" image="thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23" status="Verifying Checksum"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:05 moby root: time="2017-06-06T18:32:05.024170203Z" level=debug msg="pull in progress" image="thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23" status="Download complete"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:05 moby root: time="2017-06-06T18:32:05.076379231Z" level=debug msg="pull in progress" current=1146880 image="thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23" status=Extracting total=21933931
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:05 moby root: time="2017-06-06T18:32:05.613672468Z" level=debug msg="memberlist: TCP connection from=172.31.27.204:46814"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:05 moby root: time="2017-06-06T18:32:05.614595425Z" level=debug msg="ip-172-31-7-39.eu-central-1.compute.internal-847524981b35: Initiating  bulk sync for networks [sbjjvwt09dtvn1ro22yp6tngb ikhuvgep746h5l9c65wsyxv1l] with node ip-172-31-27-204.eu-central-1.compute.internal-1a89ffdbe106"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:05 moby root: time="2017-06-06T18:32:05.834535086Z" level=debug msg="pull in progress" current=5505024 image="thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23" status=Extracting total=21933931
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.779238398Z" level=debug msg="Cleaning up layer 0e9f290148929587140c7493cb578fbdfe14dc94f1a7fd542d61a354b1d8dfc0: Error processing tar file(exit status 1): open /app/node_modules/babel-helper-define-map/.npmignore: no space left on device"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.799365678Z" level=error msg="Not continuing with pull after error: failed to register layer: Error processing tar file(exit status 1): open /app/node_modules/babel-helper-define-map/.npmignore: no space left on device"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.799416467Z" level=error msg="pulling image failed" error="failed to register layer: Error processing tar file(exit status 1): open /app/node_modules/babel-helper-define-map/.npmignore: no space left on device" module="node/agent/taskmanager" task.id=w3epa72doy54o99l1d3thlwkv
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.800011363Z" level=info msg="Layer sha256:fe4767e90872336f35c7321df93ef55a71dcc52f3d0facde05bb2756192e8a94 cleaned up"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout:Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.800034046Z" level=error msg="fatal task error" error="No such image: thenativeweb/enterjs2017@sha256:61790b4698f0b96c0df2135d3ab7e8b184f2926b1b20710782144d4a786adb23" module="node/agent/taskmanager" task.id=w3epa72doy54o99l1d3thlwkv
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.800079334Z" level=debug msg="state changed" module="node/agent/taskmanager" state.desired=RUNNING state.transition="PREPARING->REJECTED" task.id=w3epa72doy54o99l1d3thlwkv
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.800484344Z" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" task.id=w3epa72doy54o99l1d3thlwkv
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.801670140Z" level=debug msg="task status reported" module="node/agent"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.809494074Z" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" task.id=w3epa72doy54o99l1d3thlwkv
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:06 moby root: time="2017-06-06T18:32:06.810509354Z" level=debug msg="task status reported" module="node/agent"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:07 moby root: time="2017-06-06T18:32:07.059294250Z" level=debug msg="(*worker).Update" len(assignments)=2 module="node/agent"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:07 moby root: time="2017-06-06T18:32:07.059370900Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=0 module="node/agent"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:07 moby root: time="2017-06-06T18:32:07.059401988Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=2 module="node/agent"
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:07 moby root: time="2017-06-06T18:32:07.059431968Z" level=debug msg=assigned module="node/agent" task.desiredstate=SHUTDOWN task.id=w3epa72doy54o99l1d3thlwkv
./ip-172-31-7-39-eu-central-1-compute-internal/tail -20000 /var/log/messages.stdout-Jun  6 18:32:07 moby root: time="2017-06-06T18:32:07.059521213Z" level=debug msg=assigned module="node/agent" task.desiredstate=READY task.id=rzi6oeaeaiurcvgshcscdlbjg
nathanleclaire commented 7 years ago

something to consider: it looks like according to the log that docker creates a tmp file for image layer pulls? so it's possible that there might be some duplication of a layer that pushes the disk limit over the edge, but the attempted layer download gets cleaned up later? just making some guesses why we're seeing this behavior even if the layer with node_modules isn't that big. i'd be curious to see docker history on this image that you are trying to pull @goloroden

time="2017-06-06T18:32:04.216188966Z" level=debug msg="Downloaded 74505baa8510 to tempfile /var/lib/docker/tmp/GetImageBlob423471451"
nathanleclaire commented 7 years ago

Could totally be wrong and there's another auth-related issue too though.

aaronlehmann commented 7 years ago

The temporary file is only kept during the pull process, however there's a PR that would change that: https://github.com/moby/moby/pull/28348

goloroden commented 7 years ago

It's 209 MByte uncompressed.

Regarding docker history of this image, here we go:

IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
a9356c0b1010        2 days ago          /bin/sh -c #(nop)  CMD ["node" "/app/app.js"]   0 B                 
<missing>           2 days ago          /bin/sh -c #(nop) ADD dir:f6a30e9743073d06...   26.4 MB             
<missing>           2 days ago          /bin/sh -c cd /app &&     npm install --pr...   68.5 MB             
<missing>           2 days ago          /bin/sh -c #(nop) ADD file:eacc297a6875503...   1.4 kB              
<missing>           2 days ago          /bin/sh -c #(nop)  MAINTAINER the native w...   0 B                 
<missing>           8 months ago        /bin/sh -c apk add --no-cache curl make gc...   44.6 MB             
<missing>           8 months ago        /bin/sh -c #(nop)  ENV VERSION=v6.6.0 NPM_...   0 B                 
<missing>           11 months ago       /bin/sh -c #(nop) ADD file:852e9d0cb9d9065...   4.8 MB              

(Is it important on which machine I run this command?)

goloroden commented 7 years ago

FYI, I have SSHed into the node 172.31.7.39, and ran df -h. Here's the result:

Filesystem                Size      Used Available Use% Mounted on
overlay                  19.7G     11.1G      7.5G  60% /
tmpfs                     1.9G         0      1.9G   0% /dev
tmpfs                     1.9G         0      1.9G   0% /sys/fs/cgroup
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/xvdb1               19.7G     11.1G      7.5G  60% /var/log
/dev/xvdb1               19.7G     11.1G      7.5G  60% /etc/ssh
tmpfs                     1.9G    153.0M      1.8G   8% /etc/passwd
tmpfs                     1.9G    153.0M      1.8G   8% /etc/group
tmpfs                     1.9G    153.0M      1.8G   8% /home/docker
/dev/xvdb1               19.7G     11.1G      7.5G  60% /etc/hosts
tmpfs                     1.9G    153.0M      1.8G   8% /etc/shadow
/dev/xvdb1               19.7G     11.1G      7.5G  60% /etc/hostname
/dev/xvdb1               19.7G     11.1G      7.5G  60% /etc/resolv.conf
tmpfs                   394.6M    708.0K    393.9M   0% /var/run/docker.sock
tmpfs                     1.9G    153.0M      1.8G   8% /usr/bin/docker
/dev/xvdb1               19.7G     11.1G      7.5G  60% /var/lib/docker/swarm/lb_name
/dev/xvdb1               19.7G     11.1G      7.5G  60% /var/lib/docker/swarm/elb.config
tmpfs                     1.9G         0      1.9G   0% /proc/kcore
tmpfs                     1.9G         0      1.9G   0% /proc/timer_list
tmpfs                     1.9G         0      1.9G   0% /proc/sched_debug
tmpfs                     1.9G         0      1.9G   0% /sys/firmware

Am I missing something, or is there no disk where no space is left? If so, why do the logs say so?

aaronlehmann commented 7 years ago

Can you try df -i? The filesystem could be running out of inodes. overlay in particular is very inode-intensive.

goloroden commented 7 years ago

That's it!!!

Here is the result of df -i:

Filesystem              Inodes      Used Available Use% Mounted on
overlay                1305600   1304994       606 100% /
tmpfs                   505092        16    505076   0% /dev
tmpfs                   505092        15    505077   0% /sys/fs/cgroup
shm                     505092         1    505091   0% /dev/shm
/dev/xvdb1             1305600   1304994       606 100% /var/log
/dev/xvdb1             1305600   1304994       606 100% /etc/ssh
tmpfs                   505092      1873    503219   0% /etc/passwd
tmpfs                   505092      1873    503219   0% /etc/group
tmpfs                   505092      1873    503219   0% /home/docker
/dev/xvdb1             1305600   1304994       606 100% /etc/hosts
tmpfs                   505092      1873    503219   0% /etc/shadow
/dev/xvdb1             1305600   1304994       606 100% /etc/hostname
/dev/xvdb1             1305600   1304994       606 100% /etc/resolv.conf
tmpfs                   505092       252    504840   0% /var/run/docker.sock
tmpfs                   505092      1873    503219   0% /usr/bin/docker
/dev/xvdb1             1305600   1304994       606 100% /var/lib/docker/swarm/lb_name
/dev/xvdb1             1305600   1304994       606 100% /var/lib/docker/swarm/elb.config
tmpfs                   505092        16    505076   0% /proc/kcore
tmpfs                   505092        16    505076   0% /proc/timer_list
tmpfs                   505092        16    505076   0% /proc/sched_debug
tmpfs                   505092         1    505091   0% /sys/firmware

As we can easily see, there are several lines where it says 100% used. So I guess, that this causes the issues, right?

kencochrane commented 7 years ago

Glad you were finally able to figure it out. Now you have two options. Cleanup some of the inodes, or expand the disk to give you more inodes. The easier is to find what is using so many inodes and clean them up, if you can. It is sometimes a bunch of files in /tmp.

Try this command to see if you can find where your inodes are located.

$ sudo find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n

also, try a docker system prune to see if you can remove some of the docker items you no longer need.

goloroden commented 7 years ago

docker system prune didn't do anything (I would have been surprised if, as I set this to be run automatically each day in the setup of Docker for AWS).

I actually just tried to run the very same command that you've just suggested, but the result does not look suspicious:

/ $ sudo find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n
      1 .dockerenv
      1 entry.sh
      1 run
      3 bin
      3 sbin
      3 var
      8 lib
     61 etc
   2887 usr

Again, am I missing something?

kencochrane commented 7 years ago

@goloroden ok, I'm guessing you are running this command from ssh, which is actually inside of a docker container. You will need to run from the host, so you can see the host file system.

Try running this command, first, and then the inode one from above.

docker run -it --privileged --pid=host debian nsenter -t 1 -m -n sh
kencochrane commented 7 years ago

Sorry, if that doesn't work, try this one.

docker run --rm -it --privileged --pid=host justincormack/nsenter1 /bin/ash
goloroden commented 7 years ago

Ah, of course! Sorry for the dumb question…

I think that I ran this from the host, not from a container: First I SSH from my machine to the bastion host (this is a Docker container running sshd), then I SSH from there to the worker. Since these machines were setup by Docker for AWS, I don't know whether this takes me to a container or to the actual host.

Anyway, if I run the first command, it doesn't work (no space left on device… 😉).

But the second command does. If I then run df -i, I get the following output:

/ # find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n
      1 .ash_history
      1 init
      2 home
      4 bin
     13 containers
     25 sbin
    105 lib
    194 etc
    582 usr

Again, this does not look too bad, does it?

goloroden commented 7 years ago

If I run df -i from there, it outputs:

Filesystem              Inodes      Used Available Use% Mounted on
tmpfs                   505092      1874    503218   0% /
tmpfs                   505092       269    504823   0% /run
cgroup_root             505092        15    505077   0% /sys/fs/cgroup
dev                     497695       173    497522   0% /dev
shm                     505092         1    505091   0% /dev/shm
/dev/xvdb1             1305600   1305532        68 100% /var
tmpfs                   505092        16    505076   0% /tmp
tmpfs                   505092         6    505086   0% /Database
/dev/xvdb1             1305600   1305532        68 100% /var/lib/docker/overlay2
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/99d0fef07c67b82342b1ca79b168b4373f028c2cae907405dbbe5df7e1afcceb/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/2aa3b6489f58c9114c9ef24b9b85ac113bbba515099cb2195ab70334e541c701/shm
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/2c26aa87fd987e5f9bcae375ce126eb7012e7b9f86f24ac6639715a84731f549/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/05d815022f3cec7219fcd10b4071a83d982f7078ecb329644fc79d38f6177a02/shm
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/b83ae2b3126d639ce0c2e3ae65995179eaa6721df957972cc286f9b57f6dbe70/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/44e0a79e6c4d0a1527c0a9e6a521571ccd56a4604ebe752d126ed7882385dc22/shm
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/4b351b79da692230e9031fb782546d94631cf2649c846bb9659580adb91152b7/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/e5ea389b0f1e4b2e6da23acee74ab3d08fc6db2f5c7e3dbd46ab41e7e6733a33/shm
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/a1818ea07aed3ae0cd898db749e58d827ac91e071cca926940bfd755b0d73f46/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/500dcf7de3903f4ed2696575d8ba552ccd9a661dd24075080b39aa321a8df958/shm
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/94cf2403454a33b93f84afbd4e10ecef9fb53d6ed24ef097c31b2abf94c9ce30/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/42b971cd616ee25d050ee5298bbf6d7b0130c648a6baa8178d9b7d983fdacf30/shm
overlay                1305600   1305532        68 100% /var/lib/docker/overlay2/890a9778bbc346cbff1288d515a972660d70d1c326cac446750a5965d3bbd2ab/merged
shm                     505092         1    505091   0% /var/lib/docker/containers/51cac1dc0e9ed5024caf873b6220a32d11e711aa1969e040294f97cfb8e2ca2c/shm

Does this help?

kencochrane commented 7 years ago

It is definitely in the /var directory. Do your images have a ton of files in them? How about /var/log is there a lot of logs in there? we need to find out where all of the files are coming from.

goloroden commented 7 years ago

Regarding your questions:

I then had a look at /var/lib:

/var/lib # du -a | cut -d/ -f2 | sort | uniq -c | sort -nr
1365437 docker
      2 misc
      2 dhcpcd
      1 12169968    .

So, docker seems to be worth the next look:

/var/lib/docker # du -a | cut -d/ -f2 | sort | uniq -c | sort -nr
1362748 overlay2
   2573 image
     75 containers
     15 volumes
     11 swarm
      7 network
      5 plugins
      1 trust
      1 tmp
      1 12169948    .

The overlay2 folder contains 322 folders. The largest of them has 23305 files.

This caught my eye: 23305 is extremely close to 23455. If I take into consideration that some files are not put into the container (the .git folder, e.g.), this pretty much looks as if this belongs together. Maybe this is also just coincidence (I'm just guessing here), but it's interesting.

The list then continues like this:

/var/lib/docker/overlay2 # du -a | cut -d/ -f2 | sort | uniq -c | sort -nr
  23305 94cf2403454a33b93f84afbd4e10ecef9fb53d6ed24ef097c31b2abf94c9ce30
  20108 a1818ea07aed3ae0cd898db749e58d827ac91e071cca926940bfd755b0d73f46
  19176 5b92e30239e19e013e9615ae5fce2da966dfc502ced037dad64996f579dc872c
  19091 75a8c4702d2d5fbc5144d0ba5d1171782960b0e6317ca80c2898b1a8baf0baf8
  19070 c281d8cc0879e738b051e49981578ae402d6c942148a2e64944fc8473f61b09b
  19070 57b08e89ee8b776e90736111a4d80f09822634846918d70bde9f8cc821fd1ea9
  19070 23a43cc4592bccd9e126a6a81df23ac6743940437e132ce0b4d00c4b42904997
  19062 0161ac14d4e689d0a2335b91b9bef96277f0f0ac065d1775a18fb901bab832ba
  19049 a7182adbb1cc8a0487e5a8193daea9d90097da5ef370b471b9f0a9858bc42505
  19040 62a61c99f522b021d295ba04776ff2106f2582d00471f4cad785e148ce55b89f
  19025 e05dabe52feb945fdf2edf6e114ea2ce9d063d55c34402a68b605e7bff714e7a
  19025 8e30133bc4e832690493d88e3b7c29ff82697c9702c12d576514dea4a8af8de6
  19007 1b9117fa629a3b8ef4cef5f0070210247617f72c1b4fcf1eee77a8c64efc2a69
  16440 8ba87ebd88b99bbe93331b68c669e0e80ea82f0446b1f88dd64384dbe5a57fed
  16438 faf56d45f82e2edfbf4ca63c4d201944819815b8fc8f3d2cc2585fe12ef9fe96
  16438 ea371501ccfff82d468b575803fd03f651a47681d92b046ba18f882a8d5094bc
  16438 e4a95f5d062f50cca38cb3d30ac37082e35fed07b37fd145056850dd08dfcf39
  16438 d8d61a7989c7d918b9471415852d44ae22bb5b5b56d5e8f91ac607514484f174
  16438 cb9ce5adef52e107704b3b42351ff9e06db366a8d80a33025e398ab02c0d95df
[…]

Again, it's interesting to see lots of entries with the exact same number of files. This made me think: If I update an image by editing a file, the number of files doesn't change. Can it be that these directories contain the contents of each single version of each single image, and - for whatever reason - they never get cleaned up?

kencochrane commented 7 years ago

what do you get when you run docker images and docker images --all?

If you have one docker image that is based off of another, it will add a new layer for any changes, and that is just a diff (only something that changed between the two images). So each layer shouldn't have duplicate files, unless you did something to every files (change permissions, etc)