docker-archive / for-aws

92 stars 26 forks source link

Swarm leader using all HD space #86

Closed cc250080 closed 7 years ago

cc250080 commented 7 years ago

Expected behavior

Daily cleanup job will avoid this from happen (it is activated)

Actual behavior

Second time my Swarm leader gets it's HD full (80Gb)

/ # df -h Filesystem Size Used Available Use% Mounted on overlay 78.7G 78.0G 0 100% / tmpfs 7.8G 0 7.8G 0% /dev tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup tmpfs 7.8G 161.2M 7.7G 2% /etc/shadow /dev/xvdb1 78.7G 78.0G 0 100% /etc/ssh tmpfs 7.8G 161.2M 7.7G 2% /home/docker /dev/xvdb1 78.7G 78.0G 0 100% /var/log tmpfs 7.8G 161.2M 7.7G 2% /etc/group tmpfs 7.8G 161.2M 7.7G 2% /etc/passwd /dev/xvdb1 78.7G 78.0G 0 100% /etc/resolv.conf /dev/xvdb1 78.7G 78.0G 0 100% /etc/hostname /dev/xvdb1 78.7G 78.0G 0 100% /etc/hosts shm 64.0M 0 64.0M 0% /dev/shm tmpfs 7.8G 161.2M 7.7G 2% /usr/bin/docker tmpfs 1.6G 964.0K 1.6G 0% /var/run/docker.sock tmpfs 7.8G 0 7.8G 0% /proc/kcore tmpfs 7.8G 0 7.8G 0% /proc/timer_list tmpfs 7.8G 0 7.8G 0% /proc/sched_debug tmpfs 7.8G 0 7.8G 0% /sys/firmware

Information

I am using Docker CE for AWS 17.06.0-ce (17.06.0-ce-aws2)

Steps to reproduce the behavior

Just wait a pair of weeks, actually, is there any way from inside the SSH access container to reach /var/lib/docker?

Actually I don't really know what is making the hard disk of the leader full every time, pruning containers and images doesn't help, /var/lib/logs is pretty empty.

Since it is the second time that this happens, in different versions, I am starting to get pretty worried by this issue.

cc250080 commented 7 years ago

I wonder, can this be because I don't enable CloudWatch? Since I am using ELK I thought CloudWatch would be redundant. Any idea?

kencochrane commented 7 years ago

@cc250080 thanks for the report. I don't think it would be related to cloudwatch, but you never know, it depends on how many logs your containers produce.

Can you run the following commands, and post the results:

This will show the disk usage for the host vs shell container.

docker run -v /:/hostroot alpine:3.6 /bin/sh -c "du -shc /hostroot/*"

This will tell us what docker items are taking up space.

docker system df

If you want more info you can use the verbose flag, but it might give info you don't want to share, so feel free to not post those results.

docker system df --verbose
FrenchBen commented 7 years ago

@cc250080 Can you share the stack that you deploy when these get full?

cc250080 commented 7 years ago

Dear @kencochrane and @FrenchBen ,

Thank you very much for giving me a hand. Unfortunately I am still with the same problem.

The results of the commands that @kencochrane suggested:

From the Swarm Leader:

<~ # docker run -v /:/hostroot alpine:3.6 /bin/sh -c "du -shc /hostroot/* Unable to find image 'alpine:3.6' locally 3.6: Pulling from library/alpine 88286f41530e: Downloading [==================================================>] 1.99MB/1.99MB docker: write /var/lib/docker/tmp/GetImageBlob316476840: no space left on device. See 'docker run --help'.

From a Swarm Manager:

~/docker # docker run -v /:/hostroot alpine:3.6 /bin/sh -c "du -shc /hostroot/*" Unable to find image 'alpine:3.6' locally 3.6: Pulling from library/alpine 88286f41530e: Already exists Digest: sha256:1072e499f3f655a032e88542330cf75b02e7bdf673278f701d7ba61629ee3ebe Status: Downloaded newer image for alpine:3.6 16.0K /hostroot/Database 872.0K /hostroot/bin 11.0M /hostroot/containers 0 /hostroot/dev 0 /hostroot/dockerimages 1.7M /hostroot/etc 910.7M /hostroot/home 4.0K /hostroot/init 5.3M /hostroot/lib 0 /hostroot/media 0 /hostroot/proc 0 /hostroot/root 1.1M /hostroot/run 11.4M /hostroot/sbin 0 /hostroot/srv 0 /hostroot/sys 8.0K /hostroot/tmp 130.4M /hostroot/usr 2.9G /hostroot/var 3.9G total

-Not sure how to interpret this results, where are the 80Gb that are making the drive full?

~ # docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 12 11 2.021GB 756.3MB (37%) Containers 11 11 104.4kB 0B (0%) Local Volumes 2 2 562.4kB 0B (0%)

FROM THE LEADER:

~ # docker system df --verbose Images space usage:

REPOSITORY TAG IMAGE ID CREATED ago SIZE SHARED SIZE UNIQUE SiZE CONTAINERS

a61778a03a75 13 days ago ago 702.9MB 642.8MB 60.08MB 1 4be1e7a93b76 2 weeks ago ago 707.2MB 642.8MB 64.41MB 1 docker4x/l4controller-aws 17.06.0-ce-aws2 448783fbfa73 4 weeks ago ago 17.72MB 3.991MB 13.73MB 1 docker4x/guide-aws 17.06.0-ce-aws2 210c5f142417 5 weeks ago ago 114.1MB 3.991MB 110.1MB 1 docker4x/init-aws 17.06.0-ce-aws2 70afba89014c 5 weeks ago ago 113.5MB 3.991MB 109.5MB 0 ef9655bb0d54 6 weeks ago ago 392.9MB 0B 392.9MB 1 docker4x/meta-aws 17.06.0-ce-aws2 9551313295bd 6 weeks ago ago 25.53MB 3.991MB 21.54MB 1 docker4x/shell-aws 17.06.0-ce-aws2 ce6980c25153 6 weeks ago ago 14.38MB 3.991MB 10.39MB 1 87895b8ba614 2 months ago ago 22.61MB 0B 22.61MB 1 3542dc1fe8b9 2 months ago ago 492.2MB 0B 492.2MB 1 bb4a6b774658 4 months ago ago 18.91MB 0B 18.91MB 1 f9ba08bafdea 5 months ago ago 57.34MB 0B 57.34MB 1 Containers space usage: CONTAINER ID IMAGE COMMAND LOCAL VOLUMES SIZE CREATED ago STATUS NAMES 97ad809555ab cc250080/taxonomy-be:0.3.264.94 "nohup java -Djava..." 0 32.8kB 4 days ago ago Up 4 days fairgarage_taxonomy-be.1.ajpd4dxpwwy397x77ul6a8hdt d939791731ac cc250080/maintenance-be:0.0.1.21 "java -Djava.secur..." 0 32.8kB 4 days ago ago Up 4 days fairgarage_maintenance-be.1.z2efyf6oag85ysklcifs27hje d4d67533067a google/cadvisor:latest "/usr/bin/cadvisor..." 0 0B 2 weeks ago ago Up 2 weeks fairgarage_cadvisor.j1ue6zm39dvzifgk6fi4gsa16.jlhl6a24uexrow0xiejatkmac 65a273de618f prom/node-exporter:v0.14.0 "/bin/node_exporte..." 0 0B 2 weeks ago ago Up 2 weeks fairgarage_node-exporter.j1ue6zm39dvzifgk6fi4gsa16.jay46pfqfpr1fyrq8ln8ungg8 9db842c4b406 cc250080/logstash-fg:latest "/docker-entrypoin..." 0 32.8kB 2 weeks ago ago Up 2 weeks fairgarage_logstash.1.ath9p0gc1ftxc62utbq669put 1b1946ed0bea gliderlabs/logspout:latest "/bin/logspout sys..." 1 0B 2 weeks ago ago Up 2 weeks fairgarage_logspout.j1ue6zm39dvzifgk6fi4gsa16.p4dk6rdrv1il9f3qg8bfv1a5x 654436f182e2 kibana:5.4 "/docker-entrypoin..." 0 4.69kB 2 weeks ago ago Up 2 weeks fairgarage_kibana.1.lye4hyj9d7eaiywocbryukkey 0b557c125302 docker4x/l4controller-aws:17.06.0-ce-aws2 "loadbalancer run ..." 0 0B 2 weeks ago ago Up 2 weeks l4controller-aws adb4f35ff914 docker4x/meta-aws:17.06.0-ce-aws2 "metaserver -iaas_..." 0 0B 2 weeks ago ago Up 2 weeks meta-aws 9bba59c82107 docker4x/guide-aws:17.06.0-ce-aws2 "/entry.sh" 0 998B 2 weeks ago ago Up 2 weeks guide-aws b756e45c44a9 docker4x/shell-aws:17.06.0-ce-aws2 "/entry.sh /usr/sb..." 1 406B 2 weeks ago ago Up 2 weeks shell-aws Local Volumes space usage: VOLUME NAME LINKS SIZE sshkey 1 562.4kB d8f002ba4ffc9d0b59fcfb58e2979f4db4084a24494102154871851e704a880b 1 0B ~ # As you can see, I suspect this is related with logs, but I have no proof, and I might be easily wrong. @FrenchBen Do you mean the CloudFormation script? I am using current stable with existing VPC. Thank you very much for your help ! Carles
FrenchBen commented 7 years ago

Couple more outputs, can you run the following from your home directory: du -d 1 -h /

cc250080 commented 7 years ago

Here is the output:

~ # du -d 1 -h / 0 /sys 31.5M /usr 1.8M /etc 0 /proc 12.0K /home 216.0K /sbin 4.0K /tmp 8.0K /run 8.0K /root 1.4M /bin 56.9M /var 4.0K /mnt 16.0K /media 2.8M /lib 0 /dev 4.0K /srv 94.7M /

Thanks !

Still, 'df -h'

~ # df -h Filesystem Size Used Available Use% Mounted on overlay 78.7G 78.7G 0 100% / tmpfs 7.8G 0 7.8G 0% /dev tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup tmpfs 7.8G 161.2M 7.7G 2% /home/docker /dev/xvdb1 78.7G 78.7G 0 100% /etc/ssh /dev/xvdb1 78.7G 78.7G 0 100% /var/log tmpfs 7.8G 161.2M 7.7G 2% /etc/group tmpfs 7.8G 161.2M 7.7G 2% /etc/passwd tmpfs 7.8G 161.2M 7.7G 2% /etc/shadow /dev/xvdb1 78.7G 78.7G 0 100% /etc/resolv.conf /dev/xvdb1 78.7G 78.7G 0 100% /etc/hostname /dev/xvdb1 78.7G 78.7G 0 100% /etc/hosts shm 64.0M 0 64.0M 0% /dev/shm tmpfs 1.6G 1.0M 1.6G 0% /var/run/docker.sock tmpfs 7.8G 161.2M 7.7G 2% /usr/bin/docker tmpfs 7.8G 0 7.8G 0% /proc/kcore tmpfs 7.8G 0 7.8G 0% /proc/timer_list tmpfs 7.8G 0 7.8G 0% /proc/sched_debug tmpfs 7.8G 0 7.8G 0% /sys/firmware

Thank you very much @FrenchBen Carles

FrenchBen commented 7 years ago

@cc250080 Thanks for the output - You wouldn't happen to be in our community? https://dockr.ly/community

May be quicker to look at a few things with you.

cc250080 commented 7 years ago

@FrenchBen I just did the sign-up, thanks also for letting me know.

In which channel I should find you or discuss topics related with Swarm and Swarm for AWS issues?

I am in #general as carles6

FrenchBen commented 7 years ago

Went through a 1-1 with @cc250080 we determined that his logstash setup was at fault, as the container wasn't huge in size, but the log file for it was: -rw-r----- 1 root root 77.1G Aug 16 07:37 e52b3f03d2a4b875b1d9d75e5f654c45ae929daa82a54c33e267cd7fb775fc71-json.log

Essentially the logging seems to be setup to log to disk (CloudWatch disabled), causing a large log file to be created for the container.