stargz-snapshotter uses up all available disk space

bodgit commented 1 year ago

I have version 0.14.3 of the snapshotter installed on some EKS nodes, some of which have been running for around 16 days. They have started to run out of disk space and it seems the majority of this is consumed by /var/lib/containerd-stargz-grpc/snapshotter/snapshots.

Is there a way to prune/clean this up automatically?

ktock commented 1 year ago

@bodgit Thanks for reporting this. Snapshots are automatically cleaned up when the image is removed. You can also manually remove images (using ctr image rm) and snapshots (using ctr snapshot rm).

What contents does consume large space under /var/lib/containerd-stargz-grpc/ (visible by smtg like du -hxd 2 /var/lib/containerd-stargz-grpc/)?

bodgit commented 1 year ago

Hi @ktock

Here's the output from du -hxd 2 /var/lib/containerd-stargz-grpc/:

[root@ip-10-202-107-137 ~]# du -hxd 2 /var/lib/containerd-stargz-grpc/
34G     /var/lib/containerd-stargz-grpc/snapshotter/snapshots
34G     /var/lib/containerd-stargz-grpc/snapshotter
0       /var/lib/containerd-stargz-grpc/stargz/httpcache
0       /var/lib/containerd-stargz-grpc/stargz/fscache
0       /var/lib/containerd-stargz-grpc/stargz
34G     /var/lib/containerd-stargz-grpc/

The nodes have a 50 GB disk, 12 GB of that is consumed by /var/lib/containerd, so that and the 34 GB above accounts for most of the disk.

I tried running ctr -n k8s.io snapshot ls and there are no snapshots. There are about 800 images returned by ctr -n k8s.io images ls but historically we haven't had to worry about this.

ktock commented 1 year ago

@bodgit Thanks for the info.

34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots

What does consume the large space under this directory? Are there many snapshot dirs or is there a large snapshot dir (or a file)?

ctr -n k8s.io snapshot ls and there are no snapshots.

You need --snapshotter=stargz to get the list of snapshots (i.e. ctr-remote snapshot --snapshotter=stargz ls).

800 images returned by ctr -n k8s.io images ls

Are there active snapshot mounts (mount | grep stargz) on the node?

bodgit commented 1 year ago

@bodgit Thanks for the info.

34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots

What does consume the large space under this directory? Are there many snapshot dirs or is there a large snapshot dir (or a file)?

Lots of snapshot directories. All of them are under 1 GB but there are about 6-700 of them.

ctr -n k8s.io snapshot ls and there are no snapshots.

You need --snapshotter=stargz to get the list of snapshots (i.e. ctr-remote snapshot --snapshotter=stargz ls).

Ah, that worked. Running ctr-remote -n k8s.io snapshot --snapshotter=stargz ls returns the same number of entries as there are directories above.

On this particular host, there 612 snapshots. 117 of them are "Active", 495 of them are "Committed". Some of the committed snapshots don't have a parent SHA256.

800 images returned by ctr -n k8s.io images ls

Are there active snapshot mounts (mount | grep stargz) on the node?

That's picking up any mount that has /var/lib/containerd-stargz-grpc/snapshotter/... in the output rather than a particular mount type? There's 109 matching entries, all seem to be something similar to this:

overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/15ef6c38b6cac6dffc8dfece99257066d85ab7eb23fe8ffb1ea96fb7e33cfe92/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/14256/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/10277/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/131/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/79/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/77/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/60/fs,upperdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/17211/fs,workdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/17211/work)

They're all "overlay" mounts and they seem to vary by the number of lowerdir entries.

Is it a case of cleaning up the committed snapshots and keeping the active ones? Assuming the number of active mounts seems roughly the same as the number of overlay mounts?

To be clear, we're not (yet) trying to use any stargz images, I just installed the snapshotter on the EKS nodes to make sure everything still worked as before with our existing workloads.

Everything seems to be working fine, but it's now using more disk space and it seems relative to how long the node has been running. So eventually, the node runs out of disk space and needs to be recycled, which isn't ideal.

bodgit commented 1 year ago

I think I've found the problem. I noticed we were getting this message logged often:

kubelet: E0820 03:07:11.954394    3800 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.stargz\": stat failed on /var/lib/containerd/io.containerd.snapshotter.v1.stargz with error: no such file or directory" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.stargz"

Every five minutes I was also seeing this:

kubelet: E0820 03:11:39.556258    3800 kubelet.go:1386] "Image garbage collection failed multiple times in a row" err="invalid capacity 0 on image filesystem"

On a hunch I manually created the /var/lib/containerd/io.containerd.snapshotter.v1.stargz directory and the first error message stopped repeating, and then within five minutes there was a flurry of logs and then I saw:

kubelet: I0822 13:31:40.606893    3800 kubelet.go:1400] "Image garbage collection succeeded"

The disk usage had gone from 94% down to 40%

I've gone through the install documentation and I can't see any mention of having to create this missing directory, but it seems critical that it exists otherwise image garbage collection stops working. Is it just a case of manually creating it or should it be being created automatically?

Here's the contents of /var/lib/containerd:

# ls -l /var/lib/containerd/
total 0
drwxr-xr-x 4 root root 33 Jul 11 14:51 io.containerd.content.v1.content
drwxr-xr-x 4 root root 41 Aug  4 14:51 io.containerd.grpc.v1.cri
drwx------ 2 root root 18 Aug  4 14:52 io.containerd.grpc.v1.introspection
drwx--x--x 2 root root 21 Jul 11 14:51 io.containerd.metadata.v1.bolt
drwx--x--x 2 root root  6 Jul 11 14:51 io.containerd.runtime.v1.linux
drwx--x--x 3 root root 20 Aug  4 14:51 io.containerd.runtime.v2.task
drwx------ 2 root root  6 Jul 11 14:51 io.containerd.snapshotter.v1.btrfs
drwx------ 3 root root 23 Jul 11 14:51 io.containerd.snapshotter.v1.native
drwx------ 3 root root 23 Jul 11 14:51 io.containerd.snapshotter.v1.overlayfs
drwx------ 2 root root  6 Aug 22 13:29 io.containerd.snapshotter.v1.stargz
drwx------ 2 root root  6 Aug 22 13:44 tmpmounts

The other *snapshotter* directories already existed and are either empty or just have an empty snapshots directory within them, nothing else.

ktock commented 1 year ago

Thanks for finding the root cause and the workaround. That directory should be handled by containerd (or cri plugin) so I think we need to fix containerd for completely fixing this issue.

maxpain commented 1 year ago

The same problem. Any updates on this?

jonathanbeber commented 2 months ago

Is there any other issues where this problem is being tracked? I'm seeing the same problem.

containerd / stargz-snapshotter

stargz-snapshotter uses up all available disk space #1349