Closed dmitry-lyfar closed 11 months ago
Reproduced and LXD debug log shows this after exit
command sent:
DEBUG [2023-09-06T09:14:28+01:00] Instance process stopped PID=54554 err="<nil>" exitStatus=0 instance=c1 interactive=true project=default
DEBUG [2023-09-06T09:14:28+01:00] Exec control handler finished PID=54554 instance=c1 interactive=true project=default
And then when the ping
command is manually killed we see:
DEBUG [2023-09-06T09:15:37+01:00] Websocket: Sending barrier message address=@
DEBUG [2023-09-06T09:15:37+01:00] Websocket: Stopped read mirror address=@
DEBUG [2023-09-06T09:15:37+01:00] Websocket: Stopped write mirror address=@
DEBUG [2023-09-06T09:15:37+01:00] Exec mirror websocket finished PID=54554 instance=c1 interactive=true number=0 project=default
DEBUG [2023-09-06T09:15:37+01:00] Updated metadata for operation class=websocket description="Executing command" operation=1c628329-00c1-4f1a-b89f-a8198c6dd41c project=default
DEBUG [2023-09-06T09:15:37+01:00] Success for operation class=websocket description="Executing command" operation=1c628329-00c1-4f1a-b89f-a8198c6dd41c project=default
DEBUG [2023-09-06T09:15:37+01:00] Event listener server handler stopped listener=cdc5433d-aaaa-4c6d-9890-1cacc3f17787 local=/var/lib/lxd/unix.socket remote=@
@MusicDin @mihalicyn I've been doing a fair bit of research and experimentation around this.
The key problem is that when the shell exits with a background process that is still holding the pty open, but not producing output the any reads or polls on that pty will block.
Currently we are using io.Copy
to read from the pty and write to the websocket, and that is what is blocking.
I have tried reintroducing the approach Christian used with poll
that was removed in https://github.com/canonical/lxd/pull/11606/commits/54e3da881103c42d6b4813e8930bde1b10edb236 and it still blocked, but this time on poll
via GetPollRevents
.
I wanted to see if the approach Christian originally used with the poll
syscall had actually ever worked and whether it was just me who had inadvertently broken his implementation later on.
I found two things out:
4.0/stable
channel then the interactive exec session correctly exits even if there are background tasks still running and not generating output. So it seems at least it did used to work (but see later on for troubling news on that).4.0/edge
(still very old) it doesn't work anymore and blocks as it does today.I'm pretty sure that the breakage between 1. and 2. is caused by https://github.com/canonical/lxd/commit/06ada13bc833e1cd0003aa120aaf64a57f0974ad because rather than closing ch
channel in first go routine when GetPollRevents
returns the relevant events, it instead changes it to cancelling the context. This leaves the only place where the ch
channel is closed being in the 2nd go routine, but this is the one that blocks on GetPollRevents
.
Whoops, my bad.
However I then looked at the earlier implementation in more detail to figure out how it worked and it dawned on me that really the "working" version in 4.0/stable wasn't really working properly either. It wasn't unblocking the blocked GetPollRevents
as even after I exited the interactive session I can still see a Go routine being leaked blocked on poll
:
E.g. this is from 4.0/stable
after I've exited an interactive session successfully with background tasks that don't produce output:
goroutine 3392 [syscall]:
syscall.Syscall6(0x10f, 0xc000aa86e0, 0x1, 0x0, 0x0, 0x0, 0x0)
/snap/go/9991/src/syscall/asm_linux_amd64.s:43 +0x5
golang.org/x/sys/unix.ppoll(0xc000938000?, 0xc000aa86c0?, 0x41ace5?, 0xc000aa86a8?)
/build/lxd/parts/lxd/src/.go/pkg/mod/golang.org/x/sys@v0.0.0-20220114195835-da31bd327af9/unix/zsyscall_linux.go:124 +0x58
golang.org/x/sys/unix.Ppoll({0xc000aa86e0?, 0x1010100004f3aa0?, 0x7f48846a77b0?}, 0x0?, 0x7f48ce701a68?)
/build/lxd/parts/lxd/src/.go/pkg/mod/golang.org/x/sys@v0.0.0-20220114195835-da31bd327af9/unix/syscall_linux.go:147 +0x45
golang.org/x/sys/unix.Poll({0xc000aa86e0?, 0xc000aa86e8?, 0x45ad92?}, 0x20000?)
/build/lxd/parts/lxd/src/.go/pkg/mod/golang.org/x/sys@v0.0.0-20220114195835-da31bd327af9/unix/syscall_linux.go:156 +0x99
github.com/lxc/lxd/shared.GetPollRevents(0xc0029a7ae0?, 0xc000918000?, 0x203b?)
/build/lxd/parts/lxd/src/shared/util_linux.go:638 +0x52
github.com/lxc/lxd/shared.ExecReaderToChannel.func3()
/build/lxd/parts/lxd/src/shared/util_linux.go:517 +0x12c
created by github.com/lxc/lxd/shared.ExecReaderToChannel
/build/lxd/parts/lxd/src/shared/util_linux.go:506 +0x21a
So even when it did work (as in return from lxc exec
command) it was leaking go routines.
What I have found that works is to specify a timeout on the poll
call (I've used 1s) and so that it always returns eventually.
This then seems to simplify the function a lot as we just need to call poll
with a 1s timeout looking for (unix.POLLIN | unix.POLLPRI)
(indicating a read is ready) and then perform the read.
Then we can check if the parent process (the shell) has exited the next time a read is requested and if it has, then don't even call poll
.
My POC is here:
https://github.com/canonical/lxd/pull/12381
This doesn't leak go routines and returns within 1s of the parent process exiting if there are silent background tasks running.
Great investigation, Tom!
@mihalicyn do you see any problem with keeping poll()
ing every 1s?
@mihalicyn do you see any problem with keeping
poll()
ing every 1s?
No, I don't. I'll take a look closer on your PR.
Required information
Issue description
A shell session does not handle Ctrl+D properly and hangs if there is a background command running.
Steps to reproduce
Information to attach
dmesg
)lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue)