Closed webdock-io closed 4 days ago
I see this one on my machine too. My lxd hung and it got me really confused. I couldn't understand what happened until I saw this issue. Any fix or thoughts on this?
Just wanted to check up on this.
Any update on this?
I've reproduced this; I grabbed a stacktrace from all goroutines (runtime.Stack(buf, true)
) while the operation was hanging and it looks like a deadlock:
goroutine 1744 [select, 2 minutes]:
github.com/canonical/lxd/lxd/locking.Lock({0x262dc58, 0x3cc02a0}, {0xc0020d1d70, 0x24})
/home/wesley/Workspace/lxd/lxd/locking/lock.go:64 +0x12b
github.com/canonical/lxd/lxd/instance/drivers.(*common).updateBackupFileLock(0xc001631800, {0x262dc58, 0x3cc02a0})
/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:1595 +0x125
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).Delete(0xc001631800, 0x1)
/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3669 +0x55
github.com/canonical/lxd/lxd/instance/drivers.(*common).snapshotCommon.func1()
/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:730 +0x22
github.com/canonical/lxd/shared/revert.(*Reverter).Fail(0xc003017bc8)
/home/wesley/Workspace/lxd/shared/revert/revert.go:29 +0x34
github.com/canonical/lxd/lxd/instance/drivers.(*common).snapshotCommon(0xc002838480, {0x266e3e0, 0xc002838480}, {0xc002fb6490, 0xa}, {0x18?, 0x71d7dcc1fa68?, 0x0?}, 0x0)
/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:743 +0x885
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).snapshot(0xc002838480, {0xc002fb6490, 0xa}, {0x102ad5e?, 0x0?, 0x0?}, 0x0)
/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3437 +0x3b1
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).Snapshot(0xc002838480, {0xc002fb6490, 0xa}, {0xc0013196c8?, 0xc001319788?, 0x0?}, 0x0)
/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3449 +0xca
main.instanceSnapshotsPost.func2(0xc001686410?)
/home/wesley/Workspace/lxd/lxd/instance_snapshot.go:333 +0x91
github.com/canonical/lxd/lxd/operations.(*Operation).Start.func1(0xc00099f680)
/home/wesley/Workspace/lxd/lxd/operations/operations.go:287 +0x26
created by github.com/canonical/lxd/lxd/operations.(*Operation).Start in goroutine 1709
/home/wesley/Workspace/lxd/lxd/operations/operations.go:286 +0x105
Indeed, the instance_updatebackupfile_PROJECT_INSTANCE
lock is held throughout a snapshot operation. Instance Delete
also acquires the lock, so when the snapshot creation fails and the snapshot is deleted, Delete
is unable to acquire the lock.
Thanks Tom for making me aware of https://documentation.ubuntu.com/lxd/en/latest/server/#server-core:core.debug_address
Ubuntu Noble LXD 5.21.1 LTS
Creating an LXD container on a zfs backed filesystem, where you've set a quota (we set refquota flag on the pool as well, not sure if it matters) and then completely fill up the disk with dd - where profile disk is set to:
size: 450GB
And df shows
If we then do
dd if=/dev/zero of=temp.bin bs=1G count=420
And make sure df shows 0 byes available, and on the host zfs list also shows 0 bytes available
And then do
lxc snapshot --reuse --no-expiry bigdata mysnapshot
LXD will hang forever. You should see the command just sitting there if you do
ps aux
. What's worse, if you kill the snapshot operation, and other operations like snapshot delete will also hang. Only remedy was to dosnap restart lxd
and then lxd perked up immediately, we could free some space and redo the snapshot (which worked fine as soon as some space was free on disk).Just snapshotting with zfs works instantly, so I suspect LXD is trying to write some data to the instance and this is what's hanging. How much space free on a zfs volume is required for a snapshot to work?