canonical / multipass

Multipass orchestrates virtual Ubuntu instances
https://multipass.run
GNU General Public License v3.0
7.77k stars 642 forks source link

Multipass can't stop QEMU instances on a 10.15 macOS VM #2784

Open ricab opened 1 year ago

ricab commented 1 year ago

Describe the bug The stop command appears to hang when trying to stop a QEMU instance. After it returns, the instance is still seen as running by Multipass. The instance goes through shutdown, but the process does not quit. In addition, multipass list hangs for a little, while multipass tries to get extra IPs from the instance via SSH.

I can reproduce this reliably in this setting. It is unclear whether it is specific to 10.15, nesting, or something else.

To Reproduce How, and what happened?

  1. multipass start
  2. multipass stop

Expected behavior The instance would stop and the process would quit. Failing that, the multipass daemon would log an error or warning message saying that the process failed to quit.

Example session

$ mp launch focal
Launched: apparent-goldcrest
$ mp ls
Name                    State             IPv4             Image
apparent-goldcrest      Running           192.168.64.17    Ubuntu 20.04 LTS
bbb                     Stopped           --               Ubuntu 18.04 LTS
$ mp stop --all
$ mp ls
Name                    State             IPv4             Image
apparent-goldcrest      Running           192.168.64.17    Ubuntu 20.04 LTS
bbb                     Stopped           --               Ubuntu 18.04 LTS
$ mp ls
Name                    State             IPv4             Image
apparent-goldcrest      Running           192.168.64.17    Ubuntu 20.04 LTS
bbb                     Stopped           --               Ubuntu 18.04 LTS
$ time mp ls
Name                    State             IPv4             Image
apparent-goldcrest      Running           192.168.64.17    Ubuntu 20.04 LTS
bbb                     Stopped           --               Ubuntu 18.04 LTS

real    0m20,179s
user    0m0,037s
sys     0m0,031s

Logs

From the end of the launch:

[2022-10-14T15:06:19.769] [debug] [ssh session] Executing '[ -e /var/lib/cloud/instance/boot-finished ]'
[2022-10-14T15:06:21.295] [debug] [ssh session] Executing '[ -e /var/lib/cloud/instance/boot-finished ]'
[2022-10-14T15:06:23.243] [debug] [ssh session] Executing '[ -e /var/lib/cloud/instance/boot-finished ]'
[2022-10-14T15:06:25.402] [debug] [ssh session] Executing '[ -e /var/lib/cloud/instance/boot-finished ]'
[2022-10-14T15:06:25.667] [debug] [daemon] Returning setting local.privileged-mounts=true
[2022-10-14T15:06:27.013] [debug] [apparent-goldcrest] QMP: {"timestamp": {"seconds": 1665785187, "microseconds": 13546}, "event": "RTC_CHANGE", "data": {"offset": 395728}}

[2022-10-14T15:17:51.026] [debug] [apparent-goldcrest] QMP: {"timestamp": {"seconds": 1665785871, "microseconds": 23908}, "event": "RTC_CHANGE", "data": {"offset": 395728}}

[2022-10-14T15:25:37.357] [debug] [utils] executing 'ip -brief -family inet address show scope global'
[2022-10-14T15:25:37.358] [debug] [ssh session] Executing 'ip -brief -family inet address show scope global'
[2022-10-14T15:25:40.486] [debug] [ssh process] /Users/cibot/multipass-private/src/ssh/ssh_process.cpp:118 read_stream(type = 0, timeout = -1):
[2022-10-14T15:25:40.486] [debug] [ssh process] /Users/cibot/multipass-private/src/ssh/ssh_process.cpp:136 read_stream(): num_bytes = 50
[2022-10-14T15:25:40.489] [debug] [ssh process] /Users/cibot/multipass-private/src/ssh/ssh_process.cpp:136 read_stream(): num_bytes = 0
[2022-10-14T15:25:48.047] [debug] [ssh session] Executing 'wall The system is going down for poweroff now'
[2022-10-14T15:25:48.846] [info] [sshfs-mounts] No mounts to stop for instance "apparent-goldcrest"
[2022-10-14T15:25:48.855] [debug] [apparent-goldcrest] QMP: {"return": {}}

[2022-10-14T15:25:48.855] [debug] [apparent-goldcrest] QMP: {"timestamp": {"seconds": 1665786348, "microseconds": 855393}, "event": "POWERDOWN"}

[2022-10-14T15:25:48.856] [info] [apparent-goldcrest] VM powering down
[2022-10-14T15:25:53.016] [debug] [apparent-goldcrest] QMP: {"timestamp": {"seconds": 1665786353, "microseconds": 15552}, "event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-shutdown"}}

[2022-10-14T15:25:53.017] [info] [apparent-goldcrest] VM shut down
[2022-10-14T15:26:18.857] [debug] [daemon] instance "bbb" does not need stopping
[2022-10-14T15:26:53.867] [debug] [base_vm] Error getting extra IP addresses: ssh connection failed: 'Timeout connecting to 192.168.64.17'
[2022-10-14T15:28:10.269] [debug] [base_vm] Error getting extra IP addresses: ssh connection failed: 'Timeout connecting to 192.168.64.17'
[2022-10-14T15:29:40.909] [debug] [base_vm] Error getting extra IP addresses: ssh connection failed: 'Timeout connecting to 192.168.64.17'

Additional info

townsend2010 commented 11 months ago

Hey @ricab,

You say a 10.15 macOS VM, so you mean like one we have on our Mini and not directly on the Mac host itself? Also, I wonder if this can just be closed since we are dropping 10.15 support in the upcoming 1.13 release???

ricab commented 11 months ago

Hmm, I believe this happened when I was working on that dedicated testflinger VM. I don't know whether the macOS version makes a difference. There could have been some sort of race that manifested because things were slower. Idk...

wolfch-elsevier commented 9 months ago

I was suffering this issue on MacOS 13.5 (Ventura). I am running Multipass 1.12.2+mac. TL;DR : if your VM is out of disk space, that could be why.

==> /Library/Logs/Multipass/multipassd.log <==
[2024-01-12T06:03:30.648] [debug] [primary] QMP: {"return": {}}

[2024-01-12T06:03:30.650] [debug] [primary] QMP: {"timestamp": {"seconds": 1705057410, "microseconds": 647553}, "event": "POWERDOWN"}

[2024-01-12T06:03:30.650] [info] [primary] VM powering down
[2024-01-12T06:09:50.650] [debug] [base_vm] Error getting extra IP addresses: ssh connection failed: 'Timeout connecting to 192.168.64.6'
[2024-01-12T06:10:00.903] [info] [daemon] Cannot open ssh session on "primary" shutdown: ssh connection failed: 'Failed to connect: Host is down'
[2024-01-12T06:10:00.904] [debug] [primary] QMP: {"return": {}}
{"timestamp": {"seconds": 1705057800, "microseconds": 903987}, "event": "POWERDOWN"}

...then I tried to use the UI icon menu to exit multipass, but that only seems to exit the UI - I still have:

$ multipass list
Name                    State             IPv4             Image
primary                 Running           192.168.64.6     Ubuntu 20.04 LTS
ubuntu-2004             Stopped           --               Ubuntu 20.04 LTS
ubuntu-lts              Stopped           --               Ubuntu 20.04 LTS

Then I tried to shutdown the instance from within:

$ multipass shell primary
ubuntu@primary:~$ sudo -sH
root@primary:/home/ubuntu# sync;halt
(back out to the host shell)
$ multipass list
Name                    State             IPv4             Image
primary                 Running           192.168.64.6     Ubuntu 20.04 LTS
ubuntu-2004             Stopped           --               Ubuntu 20.04 LTS
ubuntu-lts              Stopped           --               Ubuntu 20.04 LTS

Still running! ...and then tried to get back in:

$ multipass shell primary
shell failed: ssh connection failed: 'Connection refused'

So halt just killed the instances' sshd without fully shutting down the instance.

So then I tried to stop/start the multipass daemon in a separate terminal:

sudo launchctl stop com.canonical.multipassd

...this caused the hanging multipass stop primary to output:

$ multipass stop primary
stop failed: cannot connect to the multipass socket

...then I restarted the service daemon:

sudo launchctl start com.canonical.multipassd

Then I checked the state:

$ multipass list
Name                    State             IPv4             Image
primary                 Running           192.168.64.6     Ubuntu 20.04 LTS
ubuntu-2004             Stopped           --               Ubuntu 20.04 LTS
ubuntu-lts              Stopped           --               Ubuntu 20.04 LTS

Then I tried rebooting my Mac laptop - but that instance IS STILL RUNNING!! See: mp_reboot_log.txt

Finally I tried to see if I could find the VM's process and kill it:

# ps -ef | grep -i qemu
    0   478   323   0  6:35AM ??         0:13.86 /Library/Application Support/com.canonical.multipass/bin/qemu-system-x86_64 -accel hvf -drive file=/Library/Application Support/com.canonical.multipass/bin/../Resources/qemu/edk2-x86_64-code.fd,if=pflash,format=raw,readonly=on -cpu host -nic vmnet-shared,model=virtio-net-pci,mac=52:54:00:1a:0c:d1 -device virtio-scsi-pci,id=scsi0 -drive file=/var/root/Library/Application Support/multipassd/qemu/vault/instances/primary/ubuntu-20.04-server-cloudimg-amd64.img,if=none,format=qcow2,discard=unmap,id=hda -device scsi-hd,drive=hda,bus=scsi0.0 -smp 1 -m 1024M -qmp stdio -chardev null,id=char0 -serial chardev:char0 -nographic -cdrom /var/root/Library/Application Support/multipassd/qemu/vault/instances/primary/cloud-init-config.iso -loadvm suspend -machine pc-i440fx-8.0
    0  3264  2421   0  7:16AM ttys000    0:00.00 grep -i qemu
# kill -9 478
$ multipass list
Name                    State             IPv4             Image
primary                 Stopped           --               Ubuntu 20.04 LTS
ubuntu-2004             Stopped           --               Ubuntu 20.04 LTS
ubuntu-lts              Stopped           --               Ubuntu 20.04 LTS

Ok, killing the process stopped the VM. However, upon attempting to restart the VM, it failed to restart, but restarted on the 2nd attempt:

$ multipass shell primary
shell failed: [ssh client] channel creation failed: ''
$ multipass shell primary
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-121-generic x86_64)
[...]
Last login: Fri Jan 12 06:27:00 2024 from 192.168.64.1
/home/ubuntu/.pyenv/libexec/pyenv-init: line 165: cannot create temp file for here-document: No space left on device

I think the out-of-space issue may have contributed to the hanging in the first place.

And it hung again when running multipass stop primary, so I had to kill the process again. Next, I attempted to resize the virtual disk, as suggested in #62 :

$ multipass set local.primary.disk=20G

/Library/Logs/Multipass/multipassd.log :

[...]
Rebuilding refcount structure
Repairing cluster 2916 refcount=1 reference=0
Repairing cluster 2917 refcount=1 reference=0
Repairing cluster 32904 refcount=1 reference=0
Repairing cluster 66009 refcount=1 reference=0

[2024-01-12T07:33:36.940] [debug] [daemon] Succeeded setting local.primary.disk=20G

Then I had to go back in and grow the partition to the new size, per: https://manpages.ubuntu.com/manpages/xenial/en/man1/growpart.1.html Now when I do multipass info primary I see: Disk usage: 4.6GiB out of 19.3GiB ...and inside the VM: /dev/sda1 20G 4.6G 15G 24% /

Now I'm able to start/stop the VM normally.

omar-bizreh commented 8 months ago

I was facing the same issue with one instance only but on Ubuntu:

Error from the logs:

Attempts:

Solution: I went to this file /var/snap/multipass/common/data/multipassd/multipassd-vm-instances.json (hint was from this comment) and noticed that state is set to 4. I just edited it to be 0, restart multipass snap, and the vm can now boot without issues.

I'm not into details of multipass but it seems it acts based on the content of that file and for some reason when I last stopped multipass it wasn't able to update state.

sed-i commented 6 months ago

I experience something similar on jammy/amd64: When one instance is non responsive (high load for a long time), multipass list hangs, and multipass stop doesn't stop the instance.