Closed paride closed 1 year ago
full pprof trace https://paste.ubuntu.com/p/7WPjrfW4KX/
There is a thread blocked in the digital ocean qmp package:
goroutine 734540 [IO wait]:
internal/poll.runtime_pollWait(0x7fed2479e3f0, 0x72)
/snap/go/current/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000c24c00?, 0xc000ca00b3?, 0x0)
/snap/go/current/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/snap/go/current/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000c24c00, {0xc000ca00b3, 0xf4d, 0xf4d})
/snap/go/current/src/internal/poll/fd_unix.go:167 +0x299
net.(*netFD).Read(0xc000c24c00, {0xc000ca00b3?, 0x0?, 0x0?})
/snap/go/current/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc00058e6f8, {0xc000ca00b3?, 0xc001addda8?, 0x411b31?})
/snap/go/current/src/net/net.go:183 +0x45
bufio.(*Scanner).Scan(0xc001addf08)
/snap/go/current/src/bufio/scan.go:214 +0x876
github.com/digitalocean/go-qemu/qmp.(*SocketMonitor).listen(0xc000cb2280, {0x20e1880?, 0xc00058e6f8?}, 0xc000c7eb40, 0xc000c7eba0)
/build/lxd/parts/lxd/src/.go/pkg/mod/github.com/digitalocean/go-qemu@v0.0.0-20221209210016-f035778c97f7/qmp/socket.go:175 +0x10b
created by github.com/digitalocean/go-qemu/qmp.(*SocketMonitor).Connect
/build/lxd/parts/lxd/src/.go/pkg/mod/github.com/digitalocean/go-qemu@v0.0.0-20221209210016-f035778c97f7/qmp/socket.go:151 +0x36a
OK so this could be caused by a bug in the digitalocean/go-qemu package.
So LXD calls qmp.NewSocketMonitor() which succeeds https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/qmp/monitor.go#L227
Then LXD tries a Connect() call in a different go routine and waits for a response https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/qmp/monitor.go#L232-L236
If after 5s Connect() doesn't return, then LXD calls Disconnect() https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/qmp/monitor.go#L244-L247
And I can see go routine blocked in Disconnect() for a long time:
goroutine 2524696 [chan receive (nil chan), 5400 minutes]:
github.com/digitalocean/go-qemu/qmp.(*SocketMonitor).Disconnect(...)
/build/lxd/parts/lxd/src/.go/pkg/mod/github.com/digitalocean/go-qemu@v0.0.0-20221209210016-f035778c97f7/qmp/socket.go:107
github.com/lxc/lxd/lxd/instance/drivers/qmp.Connect({0xc0005b4730, 0x48}, {0x1d998f9, 0x13}, 0xc000f3f320)
/build/lxd/parts/lxd/src/lxd/instance/drivers/qmp/monitor.go:245 +0x64e
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).statusCode(0xc000d4a600)
/build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7345 +0x14b
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).Render(0xc000d4a600, {0x0, 0x0, 0x1f?})
/build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7007 +0x3f2
github.com/lxc/lxd/lxd/instance/drivers.(*qemu).RenderFull(0xc000d4a600, {0xc001123590?, 0x137eb?, 0x20?})
/build/lxd/parts/lxd/src/lxd/instance/drivers/driver_qemu.go:7047 +0x4a
main.doInstancesGet.func5()
/build/lxd/parts/lxd/src/lxd/instances_get.go:460 +0x25f
created by main.doInstancesGet
/build/lxd/parts/lxd/src/lxd/instances_get.go:437 +0x1085
Looking at the go-qemu package I can see a potential bug in the way Connect() and Disconnect() work:
We can see that Disconnect() is blocked on this line:
https://github.com/digitalocean/go-qemu/blob/master/qmp/socket.go#L107
And we can see that Connect() only populates mon.stream if it gets a valid response from the QMP socket.
https://github.com/digitalocean/go-qemu/blob/master/qmp/socket.go#L154
So calling Disconnect() before Connect() has initialised mon.stream will block forever (as reading from a nil channel blocks).
So I think we need to make a contribution upstream to ensure that mon.stream is initialised in NewSocketMonitor
and closed if Connect() fails.
I also think it would be good to add a ConnectCtx()
function to that package so we can pass a connect cancellation context to it.
Waiting on upstream to merge https://github.com/digitalocean/go-qemu/pull/201
Required information
Issue description
On a machine making heavy use of LXD it sometimes happens that many LXD commands become unresponsive, e.g.
lxc list
orlxc info <some-vm-instance>
. With guidance from @tomponline I enabled pprof and apparently LXD gets stuck waiting for information from the QMP socket:Full pprof data: https://paste.ubuntu.com/p/3FwhpRtH7H/
Killing the lxd process (
kill -9 <pid of lxd>
) and waiting for snapd to restart it "fixes" the issue. This without manually killing any qemu process.Steps to reproduce
Not easy to reproduce. Happens randomly every few days on a bare metal machine running Jammy and using LXD (snap) to start/destroy several LXD containers and VMs per day.
Information to attach
dmesg
): I don't see anything LXD related, in any case: https://paste.ubuntu.com/p/zRGzXqCPbp/lxc info NAME --show-log
): the command hangs.lxc config show NAME --expanded
):lxc monitor
while reproducing the issue)lxd info