canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.29k stars 924 forks source link

Creating a snapshot of a ZFS backed container with 0 bytes free results in hung lxd when doing snapshot operations #13466

Closed webdock-io closed 4 days ago

webdock-io commented 2 months ago

Ubuntu Noble LXD 5.21.1 LTS

Creating an LXD container on a zfs backed filesystem, where you've set a quota (we set refquota flag on the pool as well, not sure if it matters) and then completely fill up the disk with dd - where profile disk is set to:

size: 450GB

And df shows

Filesystem              Size  Used Avail Use% Mounted on
lxd/containers/bigdata  420G  xxxG   xxxM xxx% /

If we then do

dd if=/dev/zero of=temp.bin bs=1G count=420

And make sure df shows 0 byes available, and on the host zfs list also shows 0 bytes available

And then do

lxc snapshot --reuse --no-expiry bigdata mysnapshot

LXD will hang forever. You should see the command just sitting there if you do ps aux. What's worse, if you kill the snapshot operation, and other operations like snapshot delete will also hang. Only remedy was to do snap restart lxd and then lxd perked up immediately, we could free some space and redo the snapshot (which worked fine as soon as some space was free on disk).

Just snapshotting with zfs works instantly, so I suspect LXD is trying to write some data to the instance and this is what's hanging. How much space free on a zfs volume is required for a snapshot to work?

capriciousduck commented 2 months ago

I see this one on my machine too. My lxd hung and it got me really confused. I couldn't understand what happened until I saw this issue. Any fix or thoughts on this?

capriciousduck commented 2 months ago

Just wanted to check up on this.

Any update on this?

MggMuggins commented 2 months ago

I've reproduced this; I grabbed a stacktrace from all goroutines (runtime.Stack(buf, true)) while the operation was hanging and it looks like a deadlock:

goroutine 1744 [select, 2 minutes]:
github.com/canonical/lxd/lxd/locking.Lock({0x262dc58, 0x3cc02a0}, {0xc0020d1d70, 0x24})
    /home/wesley/Workspace/lxd/lxd/locking/lock.go:64 +0x12b
github.com/canonical/lxd/lxd/instance/drivers.(*common).updateBackupFileLock(0xc001631800, {0x262dc58, 0x3cc02a0})
    /home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:1595 +0x125
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).Delete(0xc001631800, 0x1)
    /home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3669 +0x55
github.com/canonical/lxd/lxd/instance/drivers.(*common).snapshotCommon.func1()
    /home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:730 +0x22
github.com/canonical/lxd/shared/revert.(*Reverter).Fail(0xc003017bc8)
    /home/wesley/Workspace/lxd/shared/revert/revert.go:29 +0x34
github.com/canonical/lxd/lxd/instance/drivers.(*common).snapshotCommon(0xc002838480, {0x266e3e0, 0xc002838480}, {0xc002fb6490, 0xa}, {0x18?, 0x71d7dcc1fa68?, 0x0?}, 0x0)
    /home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:743 +0x885
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).snapshot(0xc002838480, {0xc002fb6490, 0xa}, {0x102ad5e?, 0x0?, 0x0?}, 0x0)
    /home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3437 +0x3b1
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).Snapshot(0xc002838480, {0xc002fb6490, 0xa}, {0xc0013196c8?, 0xc001319788?, 0x0?}, 0x0)
    /home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3449 +0xca
main.instanceSnapshotsPost.func2(0xc001686410?)
    /home/wesley/Workspace/lxd/lxd/instance_snapshot.go:333 +0x91
github.com/canonical/lxd/lxd/operations.(*Operation).Start.func1(0xc00099f680)
    /home/wesley/Workspace/lxd/lxd/operations/operations.go:287 +0x26
created by github.com/canonical/lxd/lxd/operations.(*Operation).Start in goroutine 1709
    /home/wesley/Workspace/lxd/lxd/operations/operations.go:286 +0x105

Indeed, the instance_updatebackupfile_PROJECT_INSTANCE lock is held throughout a snapshot operation. Instance Delete also acquires the lock, so when the snapshot creation fails and the snapshot is deleted, Delete is unable to acquire the lock.

MggMuggins commented 2 months ago

Thanks Tom for making me aware of https://documentation.ubuntu.com/lxd/en/latest/server/#server-core:core.debug_address