Open bodgit opened 1 year ago
@bodgit Thanks for reporting this. Snapshots are automatically cleaned up when the image is removed. You can also manually remove images (using ctr image rm
) and snapshots (using ctr snapshot rm
).
What contents does consume large space under /var/lib/containerd-stargz-grpc/
(visible by smtg like du -hxd 2 /var/lib/containerd-stargz-grpc/
)?
Hi @ktock
Here's the output from du -hxd 2 /var/lib/containerd-stargz-grpc/
:
[root@ip-10-202-107-137 ~]# du -hxd 2 /var/lib/containerd-stargz-grpc/
34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots
34G /var/lib/containerd-stargz-grpc/snapshotter
0 /var/lib/containerd-stargz-grpc/stargz/httpcache
0 /var/lib/containerd-stargz-grpc/stargz/fscache
0 /var/lib/containerd-stargz-grpc/stargz
34G /var/lib/containerd-stargz-grpc/
The nodes have a 50 GB disk, 12 GB of that is consumed by /var/lib/containerd
, so that and the 34 GB above accounts for most of the disk.
I tried running ctr -n k8s.io snapshot ls
and there are no snapshots. There are about 800 images returned by ctr -n k8s.io images ls
but historically we haven't had to worry about this.
@bodgit Thanks for the info.
34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots
What does consume the large space under this directory? Are there many snapshot dirs or is there a large snapshot dir (or a file)?
ctr -n k8s.io snapshot ls and there are no snapshots.
You need --snapshotter=stargz
to get the list of snapshots (i.e. ctr-remote snapshot --snapshotter=stargz ls
).
800 images returned by ctr -n k8s.io images ls
Are there active snapshot mounts (mount | grep stargz
) on the node?
@bodgit Thanks for the info.
34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots
What does consume the large space under this directory? Are there many snapshot dirs or is there a large snapshot dir (or a file)?
Lots of snapshot directories. All of them are under 1 GB but there are about 6-700 of them.
ctr -n k8s.io snapshot ls and there are no snapshots.
You need
--snapshotter=stargz
to get the list of snapshots (i.e.ctr-remote snapshot --snapshotter=stargz ls
).
Ah, that worked. Running ctr-remote -n k8s.io snapshot --snapshotter=stargz ls
returns the same number of entries as there are directories above.
On this particular host, there 612 snapshots. 117 of them are "Active", 495 of them are "Committed". Some of the committed snapshots don't have a parent SHA256.
800 images returned by ctr -n k8s.io images ls
Are there active snapshot mounts (
mount | grep stargz
) on the node?
That's picking up any mount that has /var/lib/containerd-stargz-grpc/snapshotter/...
in the output rather than a particular mount type? There's 109 matching entries, all seem to be something similar to this:
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/15ef6c38b6cac6dffc8dfece99257066d85ab7eb23fe8ffb1ea96fb7e33cfe92/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/14256/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/10277/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/131/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/79/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/77/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/60/fs,upperdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/17211/fs,workdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/17211/work)
They're all "overlay" mounts and they seem to vary by the number of lowerdir
entries.
Is it a case of cleaning up the committed snapshots and keeping the active ones? Assuming the number of active mounts seems roughly the same as the number of overlay mounts?
To be clear, we're not (yet) trying to use any stargz images, I just installed the snapshotter on the EKS nodes to make sure everything still worked as before with our existing workloads.
Everything seems to be working fine, but it's now using more disk space and it seems relative to how long the node has been running. So eventually, the node runs out of disk space and needs to be recycled, which isn't ideal.
I think I've found the problem. I noticed we were getting this message logged often:
kubelet: E0820 03:07:11.954394 3800 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.stargz\": stat failed on /var/lib/containerd/io.containerd.snapshotter.v1.stargz with error: no such file or directory" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.stargz"
Every five minutes I was also seeing this:
kubelet: E0820 03:11:39.556258 3800 kubelet.go:1386] "Image garbage collection failed multiple times in a row" err="invalid capacity 0 on image filesystem"
On a hunch I manually created the /var/lib/containerd/io.containerd.snapshotter.v1.stargz
directory and the first error message stopped repeating, and then within five minutes there was a flurry of logs and then I saw:
kubelet: I0822 13:31:40.606893 3800 kubelet.go:1400] "Image garbage collection succeeded"
The disk usage had gone from 94% down to 40%
I've gone through the install documentation and I can't see any mention of having to create this missing directory, but it seems critical that it exists otherwise image garbage collection stops working. Is it just a case of manually creating it or should it be being created automatically?
Here's the contents of /var/lib/containerd
:
# ls -l /var/lib/containerd/
total 0
drwxr-xr-x 4 root root 33 Jul 11 14:51 io.containerd.content.v1.content
drwxr-xr-x 4 root root 41 Aug 4 14:51 io.containerd.grpc.v1.cri
drwx------ 2 root root 18 Aug 4 14:52 io.containerd.grpc.v1.introspection
drwx--x--x 2 root root 21 Jul 11 14:51 io.containerd.metadata.v1.bolt
drwx--x--x 2 root root 6 Jul 11 14:51 io.containerd.runtime.v1.linux
drwx--x--x 3 root root 20 Aug 4 14:51 io.containerd.runtime.v2.task
drwx------ 2 root root 6 Jul 11 14:51 io.containerd.snapshotter.v1.btrfs
drwx------ 3 root root 23 Jul 11 14:51 io.containerd.snapshotter.v1.native
drwx------ 3 root root 23 Jul 11 14:51 io.containerd.snapshotter.v1.overlayfs
drwx------ 2 root root 6 Aug 22 13:29 io.containerd.snapshotter.v1.stargz
drwx------ 2 root root 6 Aug 22 13:44 tmpmounts
The other *snapshotter*
directories already existed and are either empty or just have an empty snapshots
directory within them, nothing else.
Thanks for finding the root cause and the workaround. That directory should be handled by containerd (or cri plugin) so I think we need to fix containerd for completely fixing this issue.
The same problem. Any updates on this?
Is there any other issues where this problem is being tracked? I'm seeing the same problem.
I have version 0.14.3 of the snapshotter installed on some EKS nodes, some of which have been running for around 16 days. They have started to run out of disk space and it seems the majority of this is consumed by
/var/lib/containerd-stargz-grpc/snapshotter/snapshots
.Is there a way to prune/clean this up automatically?