canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

Interactive shell session hangs on exit if a background command is running #12210

Closed dmitry-lyfar closed 11 months ago

dmitry-lyfar commented 1 year ago

Required information

Issue description

A shell session does not handle Ctrl+D properly and hangs if there is a background command running.

Steps to reproduce

  1. lxc launch ubuntu:20.04 u1
  2. lxc shell u1
  3. ping google.com 2>&1 > /dev/null &
  4. exit OR Ctrl-D will hang and not return to the host's shell session. If a separate shell session is launched and the process running in p.3 is killed, the original session terminates immediately

Information to attach

tomponline commented 1 year ago

Reproduced and LXD debug log shows this after exit command sent:

DEBUG  [2023-09-06T09:14:28+01:00] Instance process stopped                      PID=54554 err="<nil>" exitStatus=0 instance=c1 interactive=true project=default
DEBUG  [2023-09-06T09:14:28+01:00] Exec control handler finished                 PID=54554 instance=c1 interactive=true project=default

And then when the ping command is manually killed we see:

DEBUG  [2023-09-06T09:15:37+01:00] Websocket: Sending barrier message            address=@
DEBUG  [2023-09-06T09:15:37+01:00] Websocket: Stopped read mirror                address=@
DEBUG  [2023-09-06T09:15:37+01:00] Websocket: Stopped write mirror               address=@
DEBUG  [2023-09-06T09:15:37+01:00] Exec mirror websocket finished                PID=54554 instance=c1 interactive=true number=0 project=default
DEBUG  [2023-09-06T09:15:37+01:00] Updated metadata for operation                class=websocket description="Executing command" operation=1c628329-00c1-4f1a-b89f-a8198c6dd41c project=default
DEBUG  [2023-09-06T09:15:37+01:00] Success for operation                         class=websocket description="Executing command" operation=1c628329-00c1-4f1a-b89f-a8198c6dd41c project=default
DEBUG  [2023-09-06T09:15:37+01:00] Event listener server handler stopped         listener=cdc5433d-aaaa-4c6d-9890-1cacc3f17787 local=/var/lib/lxd/unix.socket remote=@
tomponline commented 11 months ago

@MusicDin @mihalicyn I've been doing a fair bit of research and experimentation around this.

The key problem is that when the shell exits with a background process that is still holding the pty open, but not producing output the any reads or polls on that pty will block.

Currently we are using io.Copy to read from the pty and write to the websocket, and that is what is blocking.

I have tried reintroducing the approach Christian used with poll that was removed in https://github.com/canonical/lxd/pull/11606/commits/54e3da881103c42d6b4813e8930bde1b10edb236 and it still blocked, but this time on poll via GetPollRevents.

I wanted to see if the approach Christian originally used with the poll syscall had actually ever worked and whether it was just me who had inadvertently broken his implementation later on.

I found two things out:

  1. If you use 4.0/stable channel then the interactive exec session correctly exits even if there are background tasks still running and not generating output. So it seems at least it did used to work (but see later on for troubling news on that).
  2. If you use 4.0/edge (still very old) it doesn't work anymore and blocks as it does today.

I'm pretty sure that the breakage between 1. and 2. is caused by https://github.com/canonical/lxd/commit/06ada13bc833e1cd0003aa120aaf64a57f0974ad because rather than closing ch channel in first go routine when GetPollRevents returns the relevant events, it instead changes it to cancelling the context. This leaves the only place where the ch channel is closed being in the 2nd go routine, but this is the one that blocks on GetPollRevents.

Whoops, my bad.

However I then looked at the earlier implementation in more detail to figure out how it worked and it dawned on me that really the "working" version in 4.0/stable wasn't really working properly either. It wasn't unblocking the blocked GetPollRevents as even after I exited the interactive session I can still see a Go routine being leaked blocked on poll:

E.g. this is from 4.0/stable after I've exited an interactive session successfully with background tasks that don't produce output:

goroutine 3392 [syscall]:
syscall.Syscall6(0x10f, 0xc000aa86e0, 0x1, 0x0, 0x0, 0x0, 0x0)
    /snap/go/9991/src/syscall/asm_linux_amd64.s:43 +0x5
golang.org/x/sys/unix.ppoll(0xc000938000?, 0xc000aa86c0?, 0x41ace5?, 0xc000aa86a8?)
    /build/lxd/parts/lxd/src/.go/pkg/mod/golang.org/x/sys@v0.0.0-20220114195835-da31bd327af9/unix/zsyscall_linux.go:124 +0x58
golang.org/x/sys/unix.Ppoll({0xc000aa86e0?, 0x1010100004f3aa0?, 0x7f48846a77b0?}, 0x0?, 0x7f48ce701a68?)
    /build/lxd/parts/lxd/src/.go/pkg/mod/golang.org/x/sys@v0.0.0-20220114195835-da31bd327af9/unix/syscall_linux.go:147 +0x45
golang.org/x/sys/unix.Poll({0xc000aa86e0?, 0xc000aa86e8?, 0x45ad92?}, 0x20000?)
    /build/lxd/parts/lxd/src/.go/pkg/mod/golang.org/x/sys@v0.0.0-20220114195835-da31bd327af9/unix/syscall_linux.go:156 +0x99
github.com/lxc/lxd/shared.GetPollRevents(0xc0029a7ae0?, 0xc000918000?, 0x203b?)
    /build/lxd/parts/lxd/src/shared/util_linux.go:638 +0x52
github.com/lxc/lxd/shared.ExecReaderToChannel.func3()
    /build/lxd/parts/lxd/src/shared/util_linux.go:517 +0x12c
created by github.com/lxc/lxd/shared.ExecReaderToChannel
    /build/lxd/parts/lxd/src/shared/util_linux.go:506 +0x21a

So even when it did work (as in return from lxc exec command) it was leaking go routines.

What I have found that works is to specify a timeout on the poll call (I've used 1s) and so that it always returns eventually. This then seems to simplify the function a lot as we just need to call poll with a 1s timeout looking for (unix.POLLIN | unix.POLLPRI) (indicating a read is ready) and then perform the read.

Then we can check if the parent process (the shell) has exited the next time a read is requested and if it has, then don't even call poll.

My POC is here:

https://github.com/canonical/lxd/pull/12381

This doesn't leak go routines and returns within 1s of the parent process exiting if there are silent background tasks running.

mihalicyn commented 11 months ago

Great investigation, Tom!

tomponline commented 11 months ago

@mihalicyn do you see any problem with keeping poll()ing every 1s?

mihalicyn commented 11 months ago

@mihalicyn do you see any problem with keeping poll()ing every 1s?

No, I don't. I'll take a look closer on your PR.