Open reefland opened 2 years ago
I think I am experiencing the same issue, which in my case actually results in the server becoming extremely unresponsive to zfs list
commands (it takes more than 30 minutes to list all snapshots) even though everything is running entirely on SSDs. I don't really know what extra info to provide for debugging this further (besides the info already present), but if there is anything I could add, please let me know.
Same here...
Seems to stumble over a missing snapshot:
containerd[XXX]: time="XXX" level=warning msg="snapshot garbage collection failed" error="exit status 1: \"/usr/sbin/zfs list -Hp -o name,origin,used,available,mountpoint,compression,type,volsize,quota,referenced,written,logicalused,usedbydataset data/containerd/18179\" => cannot open 'data/containerd/18179': dataset does not exist\n" snapshotter=zfs
I can't reproduce how containerd got into this state, but maybe the garbage collection should not stop if one snapshot is missing?
cc @dmcgowan @AkihiroSuda https://cloud-native.slack.com/archives/C4RJZ9Z6Y/p1730964035061469
I think i've hit the same, I've originally created issue under containerd, but it fits here better https://github.com/containerd/containerd/issues/10977
regarding the error message about missing snapshot it's trying to delete, in my case, it always tries to delete a snapshot that doesn't exist instead of the correct one. I suspect it somehow fallbacks to wrong dataset name because zfs list takes few minutes and it's not pulling the right name, see the other issue for details.
can i somehow find out why it picks the wrong dataset name? I can reproduce it with the same dataset name consistently.
Description
Seem to be having an issue with containerd ZFS snapshotter potentially not cleaning up snapshots?
Looking at systemd logs, I noticed message such as (not sure if related):
Containerd zfs dataset ranges between 34% to 80% capacity depending on when various cleanup and garbage collection happens. It has never reached zero.
This number of snapshots seems high:
For less than 70 containers:
Steps to reproduce the issue
1. 2. 3.
Describe the results you received and expected
Expecting more snapshots to be deleted overtime, but unsure how to verify existing snapshots are actually valid or not.
What version of containerd are you using?
containerd github.com/containerd/containerd 1.5.9-0ubuntu3
Any other relevant information
Show configuration if it is related to CRI plugin.