Open cajbecu opened 6 years ago
Hey, mind providing a bit more information and clarification?
In the issue title, you have "mounting shm tmpfs: no space left on device", but in the body you have "error creating overlay mount to /.../merged: no space left on device". How do those two errors relate? Once you get one, do you also start getting the other, or does a machine get into a state where they both start occurring?
I assume your docker daemon is in a default configuration, but if not, the output of docker info
would be helpful.
Can you reproduce this reliably, or does it just sometimes start happening, and if you can reproduce it, how?
It might be helpful to check if the --all
flag on df shows up anything interesting (though I wouldn't expect it to).
I am currently running into the same issue on multiple nodes running 1855.4.0 (stable).
Example of one of the failing pods:
kubectl -n monitoring describe po/prometheus-k8s-0
Name: prometheus-k8s-0
Namespace: monitoring
Priority: 0
PriorityClassName: <none>
Node: node-5.figo.systems/10.6.255.20
Start Time: Tue, 25 Sep 2018 13:09:51 +0200
Labels: app=prometheus
controller-revision-hash=prometheus-k8s-54ffbcbcf
prometheus=k8s
statefulset.kubernetes.io/pod-name=prometheus-k8s-0
Annotations: kubernetes.io/psp=restricted
Status: Running
IP: 10.6.68.17
Controlled By: StatefulSet/prometheus-k8s
Containers:
prometheus:
Container ID: docker://8d9f04d715569be99f97ce33369c3eaa72f698e81ee2274e5835b309600bfca6
Image: quay.io/prometheus/prometheus:v2.3.1
Image ID: docker-pullable://prom/prometheus@sha256:0283ae2509e1ccc71830edf382713cc3906aa55ca9418c50911854223829bf0b
Port: 9090/TCP
Host Port: 0/TCP
Args:
--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention=30d
--web.enable-lifecycle
--storage.tsdb.no-lockfile
--web.route-prefix=/
State: Running
Started: Tue, 25 Sep 2018 13:10:15 +0200
Ready: True
Restart Count: 0
Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120
Environment: <none>
Mounts:
/etc/prometheus/config_out from config-out (ro)
/etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
/prometheus from prometheus-persistent-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-th9b4 (ro)
prometheus-config-reloader:
Container ID: docker://d8a700164e0bcd95dbc30e47ea0d3e3c415b4bf023b747a44f6475cfc39ce561
Image: quay.io/coreos/prometheus-config-reloader:v0.23.0
Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:c7229ef9fb172ad15eb096d652f37badc49acea7080328a02a052a1ee343f998
Port: <none>
Host Port: <none>
Command:
/bin/prometheus-config-reloader
Args:
--log-format=logfmt
--reload-url=http://localhost:9090/-/reload
--config-file=/etc/prometheus/config/prometheus.yaml
--config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
State: Running
Started: Tue, 25 Sep 2018 13:10:22 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 10m
memory: 50Mi
Requests:
cpu: 10m
memory: 50Mi
Environment:
POD_NAME: prometheus-k8s-0 (v1:metadata.name)
Mounts:
/etc/prometheus/config from config (rw)
/etc/prometheus/config_out from config-out (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-th9b4 (ro)
rules-configmap-reloader:
Container ID: docker://8c48612b51a367b61b85d5fad1df0faa3e0fa05398f1f5ed64aba917ce6c6377
Image: quay.io/coreos/configmap-reload:v0.0.1
Image ID: docker-pullable://quay.io/coreos/configmap-reload@sha256:e2fd60ff0ae4500a75b80ebaa30e0e7deba9ad107833e8ca53f0047c42c5a057
Port: <none>
Host Port: <none>
Args:
--webhook-url=http://localhost:9090/-/reload
--volume-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
State: Waiting
Reason: CreateContainerError
Last State: Terminated
Reason: ContainerCannotRun
Message: error creating overlay mount to /var/lib/docker/overlay2/e4bf72696e188ca75187716c4583c2f967443c185be3ccf7957ec517244f4363/merged: no space left on device
Exit Code: 128
Started: Sat, 29 Sep 2018 11:12:14 +0200
Finished: Sat, 29 Sep 2018 11:12:14 +0200
Ready: False
Restart Count: 2
Limits:
cpu: 5m
memory: 10Mi
Requests:
cpu: 5m
memory: 10Mi
Environment: <none>
Mounts:
/etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-th9b4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
prometheus-persistent-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-persistent-storage-prometheus-k8s-0
ReadOnly: false
config:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-k8s
Optional: false
config-out:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
prometheus-k8s-rulefiles-0:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-k8s-rulefiles-0
Optional: false
prometheus-k8s-token-th9b4:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-k8s-token-th9b4
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 13m (x7620 over 1d) kubelet, node-5.figo.systems (combined from similar events): Error: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/243f7d1eb9d0271f1368cbcc81bf82833d81b4d441d57e23e012accb8bd6978a-init/merged: no space left on device
Normal Pulled 3m (x7657 over 1d) kubelet, node-5.figo.systems Container image "quay.io/coreos/configmap-reload:v0.0.1" already present on machine
This is not unique to kubernetes, but to starting docker containers locally as well, e.g.:
Oct 01 07:03:37 node-6.figo.systems docker[54405]: /run/torcx/bin/docker: Error response from daemon: mounting shm tmpfs: no space left on device.
And here docker info
:
Containers: 115
Running: 56
Paused: 0
Stopped: 59
Images: 37
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: v0.13.2 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
seccomp
Profile: default
selinux
Kernel Version: 4.14.67-coreos
Operating System: Container Linux by CoreOS 1855.4.0 (Rhyolite)
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 125.9GiB
Name: node-6.figo.systems
ID: MBHE:DHJX:Q7I2:SMBS:OBGU:6UWQ:GA3S:AJPY:DTY7:JJKY:Q3MG:A2ZT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
This issue might be related: https://github.com/moby/moby/issues/29638. On CoreOS the error is different though.
Some potentially helpful info:
core@node-6 ~ $ cat /proc/self/mountinfo | wc -l
25185
core@node-6 ~ $ mount | wc -l
25185
core@node-6 ~ $ docker ps -a | wc -l
117
core@node-6 ~ $ cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 11 91 1
cpu 2 145 1
cpuacct 2 145 1
blkio 10 145 1
memory 5 1389 1
devices 4 145 1
freezer 9 91 1
net_cls 3 91 1
perf_event 7 91 1
net_prio 3 91 1
hugetlb 8 91 1
pids 6 156 1
core@node-6 ~ $ find /sys/fs/cgroup/memory/docker -type d | wc -l
3
core@node-6 ~ $ find /sys/fs/cgroup/memory -type d ! -path '/sys/fs/cgroup/memory/docker*' | wc -l
152
The root cause is that the resource is busy according to journalctl -eu docker
:
...
Oct 01 08:03:12 node-6.figo.systems env[1837]: time="2018-10-01T08:03:12.903193735Z" level=error msg="Error removing mounted layer 5f18712935197fb61c86479ab40c15e7af7b3d1e652d2fe25a841d318d61cb11: remove /var/lib/docker/overlay2/6ee046bcb95387f41a58d576498f46fe8541830929cd09148e464d14786666c4/merged: device or resource busy"
Oct 01 08:03:12 node-6.figo.systems env[1837]: time="2018-10-01T08:03:12.903334746Z" level=error msg="Handler for DELETE /v1.31/containers/5f18712935197fb61c86479ab40c15e7af7b3d1e652d2fe25a841d318d61cb11 returned error: container 5f18712935197fb61c86479ab40c15e7af7b3d1e652d2fe25a841d318d61cb11: driver \"overlay2\" failed to remove root filesystem: remove /var/lib/docker/overlay2/6ee046bcb95387f41a58d576498f46fe8541830929cd09148e464d14786666c4/merged: device or resource busy"
Oct 01 08:03:25 node-6.figo.systems env[1837]: time="2018-10-01T08:03:25.779904862Z" level=error msg="Error removing mounted layer 0c4754d56f5aa974b2b3b4b0036ec1d10ebcd9573f63ee6eb7eb6b1d82e24021: remove /var/lib/docker/overlay2/b61f86d32867829c8a458ff5a1f56b09748661aae7571dd75dbdb504e92bb1c7/merged: device or resource busy"
Oct 01 08:03:25 node-6.figo.systems env[1837]: time="2018-10-01T08:03:25.780025983Z" level=error msg="Handler for DELETE /v1.31/containers/0c4754d56f5aa974b2b3b4b0036ec1d10ebcd9573f63ee6eb7eb6b1d82e24021 returned error: container 0c4754d56f5aa974b2b3b4b0036ec1d10ebcd9573f63ee6eb7eb6b1d82e24021: driver \"overlay2\" failed to remove root filesystem: remove /var/lib/docker/overlay2/b61f86d32867829c8a458ff5a1f56b09748661aae7571dd75dbdb504e92bb1c7/merged: device or resource busy"
Oct 01 08:03:38 node-6.figo.systems env[1837]: time="2018-10-01T08:03:38.593573706Z" level=error msg="Error removing mounted layer 644b6b8cd83eca92f6179de18bd1874feab626c0b347d167a74b5d3053d4f521: remove /var/lib/docker/overlay2/70e5657e93dbdf85b57f2f262797fd9387b6cc5b09ae4e333310aa1f7c1d4be1/merged: device or resource busy"
Oct 01 08:03:38 node-6.figo.systems env[1837]: time="2018-10-01T08:03:38.593700366Z" level=error msg="Handler for DELETE /v1.31/containers/644b6b8cd83eca92f6179de18bd1874feab626c0b347d167a74b5d3053d4f521 returned error: container 644b6b8cd83eca92f6179de18bd1874feab626c0b347d167a74b5d3053d4f521: driver \"overlay2\" failed to remove root filesystem: remove /var/lib/docker/overlay2/70e5657e93dbdf85b57f2f262797fd9387b6cc5b09ae4e333310aa1f7c1d4be1/merged: device or resource busy"
Oct 01 08:03:51 node-6.figo.systems env[1837]: time="2018-10-01T08:03:51.420397227Z" level=error msg="Error removing mounted layer 245b9fc2b3035d47a876934e9e2f01b147dc8bb14fd760a94fd7350faab00d2e: remove /var/lib/docker/overlay2/ef8040a41b4441fbbd2fa9557ff7a04a5fa784aeb9b150d0cfc97b9d7e95600d/merged: device or resource busy"
Oct 01 08:03:51 node-6.figo.systems env[1837]: time="2018-10-01T08:03:51.420562652Z" level=error msg="Handler for DELETE /v1.31/containers/245b9fc2b3035d47a876934e9e2f01b147dc8bb14fd760a94fd7350faab00d2e returned error: container 245b9fc2b3035d47a876934e9e2f01b147dc8bb14fd760a94fd7350faab00d2e: driver \"overlay2\" failed to remove root filesystem: remove /var/lib/docker/overlay2/ef8040a41b4441fbbd2fa9557ff7a04a5fa784aeb9b150d0cfc97b9d7e95600d/merged: device or resource busy"
Oct 01 08:04:04 node-6.figo.systems env[1837]: time="2018-10-01T08:04:04.206438606Z" level=error msg="Error removing mounted layer 64e8f782b15e5fe8d7010ac0da9b17ab73c7a352d2c0c1f3252124541c3a7543: remove /var/lib/docker/overlay2/c81399a149123655c1d477adb32ed8d2d0d1fb365b99664bc1603eed8b39042a/merged: device or resource busy"
Oct 01 08:04:04 node-6.figo.systems env[1837]: time="2018-10-01T08:04:04.206566990Z" level=error msg="Handler for DELETE /v1.31/containers/64e8f782b15e5fe8d7010ac0da9b17ab73c7a352d2c0c1f3252124541c3a7543 returned error: container 64e8f782b15e5fe8d7010ac0da9b17ab73c7a352d2c0c1f3252124541c3a7543: driver \"overlay2\" failed to remove root filesystem: remove /var/lib/docker/overlay2/c81399a149123655c1d477adb32ed8d2d0d1fb365b99664bc1603eed8b39042a/merged: device or resource busy"
Also interesting is that we never ran into this issue on 1800.7.0. Therefore this might be a regression caused by https://github.com/coreos/bugs/issues/2497? We are on a fairly young POC cluster therefore we might have not noticed before.
We found the issue it was fluent-bit still holding a handle to the logs (this applies to fluentd as well).
@trevex I'm not sure if this is 100% related since this thread seems to be a little older, but I'm running into some errors that I think closely resemble this thread. When I run df -h
I also see thousands of mount points. I'm running mesosphere DCOS on CoreOS on Google Cloud. And I'm seeing this error on all the container-bearing agents:
journalctl -eu docker
:
May 08 19:57:28 staging-aus-privateagent2 env[1929]: time="2019-05-08T19:57:28.732640790Z" level=error msg="error unmounting /var/lib/docker/overlay2/be4bdd0fe9a8e117d183b3ca42754feee5daabf78d25e70db48a88260e275d18-init/merged: invalid argument" storage-driver=overlay2
May 08 19:57:30 staging-aus-privateagent2 env[1929]: time="2019-05-08T19:57:30.083451553Z" level=error msg="Handler for POST /v1.38/containers/create returned error: error creating overlay mount to /var/lib/docker/overlay2/be4bdd0fe9a8e117d183b3ca42754feee5daabf78d25e70db48a88260e275d18-init/merged: no space left on device"
$ cat /proc/self/mountinfo | wc -l
49990
$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2023.5.0
VERSION_ID=2023.5.0
BUILD_ID=2019-03-09-0138
PRETTY_NAME="Container Linux by CoreOS 2023.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Issue Report
Bug
/run/torcx/bin/docker: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/7c0178cc475ec5e28183458635a155aaedc53766f345a2018960fdb90e769d23-init/merged: no space left on device. See '/run/torcx/bin/docker run --help'.
Container Linux Version
k8s-node-12 ~ # df -h
k8s-node-12 ~ # df -ih