Jahfry / Miscellaneous

Notes and stuff I'm posting publicly
Other
50 stars 7 forks source link

a problem: virtual machine is stopped from inside #2

Closed luby2999 closed 10 months ago

luby2999 commented 2 years ago

Hello Jahfry: Your code is great, thanks. Stop the virtual machine via the webui button, no problem. But when the virtual machine is stopped from inside the vm, the code does not work. It should be that the terminating process cannot call 'driverctl'. Is there a way to solve this? Thanks again.

Jahfry commented 2 years ago

Hmm. Since I always stop via either Proxmox UI or qm stop I hadn't run into this. I'll do some research on it and get back to you. Might be a couple of days before I can. Thanks for the report.

Jahfry commented 2 years ago

I'm not sure what is going on for you.

I created a new Windows VM without any virtio drivers (except for storage) and without the QEMU guest agent (those are good things to install, I just wanted a worst case).

When I do the following:

The post-shutdown phase of the hookscript isn't shown in the Proxmox 'Tasks' UI, but instead goes to syslog, so you don't see it in the Proxmox UI 'Tasks' when it is invoked (all other phases do go to 'Tasks').

Can you:

For example on my system it looks like this:

May 7 09:37:50 proxmox QEMU[3533285]: kvm: terminating on signal 15 from pid 2746 (/usr/sbin/qmeventd) May 7 09:37:51 proxmox qmeventd[3547183]: Starting cleanup for 100 May 7 09:37:51 proxmox qmeventd[3547183]: VM 100 GUEST HOOK (/var/lib/vz/snippets/hookscript-driverctl.pl): May 7 09:37:51 proxmox qmeventd[3547183]: VM 100 config file: /etc/pve/qemu-server/100.conf May 7 09:37:52 proxmox qmeventd[3547183]: VM 100 'post-stop'. VM stopped, doing cleanup. May 7 09:37:53 proxmox qmeventd[3547183]: /usr/sbin/driverctl --nosave unset-override 0000:0b:00.0: Success May 7 09:37:53 proxmox qmeventd[3547183]: /usr/sbin/driverctl --nosave unset-override 0000:0b:00.1: Success May 7 09:37:54 proxmox qmeventd[3547183]: /usr/sbin/driverctl --nosave unset-override 0000:0c:00.0: Success May 7 09:37:54 proxmox qmeventd[3547183]: /usr/sbin/driverctl --nosave unset-override 0000:0c:00.1: Success May 7 09:37:54 proxmox qmeventd[3547183]: driverctl overrides active: None (all PCI devices available for passthrough) May 7 09:37:54 proxmox qmeventd[3547183]: Finished cleanup for 100

Jahfry commented 2 years ago

Also, please tell me the type & model of device you are passing through. I assume it is a GPU. I'm wondering if it's a GPU that might be having problems with the reset bug. Mine don't have that as an issue so I can't do much to test it.

luby2999 commented 2 years ago

Hello Jahfry: Thank you for your reply. My cpu is E3 1230 v5 GPU is gtx670 system is PVE 7.2.3 macos monterey I follow your steps, Here is what grep qmeventd /var/log/syslog shows:(from UI button)

May  8 10:44:35 pve QEMU[1525]: kvm: terminating on signal 15 from pid 898 (/usr/sbin/qmeventd)
May  8 10:44:36 pve qmeventd[2236]: Starting cleanup for 101
May  8 10:44:36 pve qmeventd[2236]: trying to acquire lock...
May  8 10:44:37 pve qmeventd[2236]:  OK
May  8 10:44:37 pve qmeventd[2236]: VM 101 GUEST HOOK (/var/lib/vz/snippets/hookscript-driverctl.pl):
May  8 10:44:37 pve qmeventd[2236]: VM 101 config file: /etc/pve/qemu-server/101.conf
May  8 10:44:38 pve qmeventd[2236]: VM 101 'post-stop'. VM stopped, doing cleanup.
May  8 10:44:39 pve qmeventd[2236]: `/usr/sbin/driverctl --nosave unset-override 0000:01:00.0`: Success
May  8 10:44:39 pve qmeventd[2236]: `/usr/sbin/driverctl --nosave unset-override 0000:01:00.1`: Success
May  8 10:44:40 pve qmeventd[2236]: `/usr/sbin/driverctl --nosave unset-override 0000:01:00.0`: Success
May  8 10:44:40 pve qmeventd[2236]: `/usr/sbin/driverctl --nosave unset-override 0000:01:00.1`: Success
May  8 10:44:40 pve qmeventd[2236]: `driverctl` overrides active: None (all PCI devices available for passthrough)
May  8 10:44:40 pve qmeventd[2236]: Finished cleanup for 101

Here is what grep qmeventd /var/log/syslog shows:(from VM inside):

May  8 10:47:42 pve QEMU[2859]: kvm: terminating on signal 15 from pid 898 (/usr/sbin/qmeventd)
May  8 10:47:43 pve qmeventd[3103]: Starting cleanup for 101
May  8 10:47:43 pve qmeventd[3103]: vm still running
Jahfry commented 2 years ago

Any chance you're running on Proxmox VE 7.2? I haven't used it yet as I've seen some issues around it and I don't have a pressing need for it.

If your VM is fully installed and configured, I'd leaning towards this being a problem with the VM not shutting down properly. I notice it's a Macos VM ... so there may be problems around that. I don't know much about OSX but if it does something along the lines that Windows does with 'Fast Start' it may never really want to fully shut down.

I'm not a Proxmox expert and I think this is going to end up needing you to ask elsewhere to figure out why your VM isn't executing the post-stop phase with a hookscript.

For debugging, I would use a generic hookscript like:

/usr/share/pve-docs/examples/guest-example-hookscript.pl

(a copy of this file should be on your Proxmox server)

Get that working to where you see the 'post-stop' phase appear in syslog when shutting down the VM from inside. Once you've got that then I think my script will work for you. Ie, since we're never seeing the 'post-stop' when you shut down from inside the VM, I don't think it's even getting to my script, which means I can't fix the problem.

Leaving issue open. If you do find a solution, please drop a note as to what you found. Likewise if I find a way to replicate and fix I'll let you know. Eventually I'll probably look at installing a Macos VM for fun but it's low on my list.

Jahfry commented 2 years ago

Just to clarify, the reason I recommend continuing to look for a fix but keeping a hook script attached is I don't think you'll see the 'post-stop' syslog message unless you have a script that sends that. It isn't a part of normal operation without a hook script.

The reason I recommend switching to the generic one is that way you'll be able to continue using the existing VM without needing to manually do what the hook script would be doing during 'post-stop'.

luby2999 commented 2 years ago

Sincerely thank you for your guidance, I will test it again and get back to you

luby2999 commented 2 years ago

There is a similar situation: https://forum.proxmox.com/threads/hookscript-with-post-stop-when-the-vm-was-shutdown-from-the-vm-itself.72802/#post-330491

Jahfry commented 2 years ago

Based on that thread ...

if you have a USB device passed through, remove it from the VM hardware temporarily and see if that allows the hookscript to invoke 'post-stop'.

That won't be a full resolution, but would at least let you know if that is the cause.

It seems like a problem in general somewhere in either Proxmox or qemu. Not something I'll be able to directly address. If I figure out a good workaround I'll put it in but first I'll need to find a way to replicate the problem. I'll see if I can force the problem to occur using USB passthrough.

In the meantime, if you can get it working without the USB passed through, you've got enough detail to file a bug. I looked and didn't find one that seems to cover this. I'd add a link to the thread you found for background information.

luby2999 commented 2 years ago

When I add pci device passthrough and usb, it can't boot anymore

`VM 101 GUEST HOOK (/var/lib/vz/snippets/hookscript-driverctl.pl): 
VM 101 config file: /etc/pve/qemu-server/101.conf
VM 101 'pre-start' ... preparing to start.
`/usr/sbin/driverctl --nosave set-override 0000:01:00.0 vfio-pci`: Success.
`/usr/sbin/driverctl --nosave set-override 0000:01:00.1 vfio-pci`: Success.
Usage: driverctl [OPTIONS...] {COMMAND}...

Inspect or control default device driver bindings.

Supported commands:
  set-override <device> <driver>    Make <driver> the default driver
                                    for <device>
  unset-override <device>           Remove any override for <device>
  load-override <device>            Load an override previously specified
                                    for <device>
  list-devices                      List all overridable devices
  list-overrides                    List all currently specified overrides

Supported options:
 -h --help             Show this help
 -v --verbose --debug  Show verbose debug information
 -b --bus <bus>        Work on bus <bus> (default pci)
    --noprobe          Do not reprobe when setting, unsetting, or
                       loading an override
    --nosave           Do not save changes when setting or unsetting
                       an override

sh: 2: vfio-pci: not found
--Unable to start VM 101--
`/usr/sbin/driverctl --nosave set-override 0000:00:14.0
 vfio-pci`: Failed (exit code 32512). at /var/lib/vz/snippets/hookscript-driverctl.pl line 59, <_CF> line 96.
TASK ERROR: hookscript error for 101 on pre-start: command '/var/lib/vz/snippets/hookscript-driverctl.pl 101 pre-start' failed: exit code 127 `
x130844 commented 2 years ago

I have the same issue, I have to manually run the .pl vmid post-stop if the VM (windows 10) was stopped from itself. (Shutdown option in windows). PVE 7.2.x

Jahfry commented 2 years ago

I'm slowly getting my machines running in a new location and may be able to look further at this "soon".

Jahfry commented 2 years ago

I have the same issue, I have to manually run the .pl vmid post-stop if the VM (windows 10) was stopped from itself. (Shutdown option in windows). PVE 7.2.x

I've set up a new Proxmox VE 7.2-17 this week and setup a fresh Win10 VM for testing this.

Unfortunately this still isn't replicating on my system. The hookscript is executing post-stop regardless of the VM being shut down from the Proxmox UI, from within the Windows power menu, or from the host via qm shutdown 100.

If you're able to find a likely error I'll be happy to look at it again. But I don't really know where to start.

Note: I have a very detailed log of how I have my system set up to this point in these pages ... you might look to see if there is some package or option I'm adding that you don't have. But I don't do a significant amount of Proxmox changes.

Jahfry commented 2 years ago

When I add pci device passthrough and usb, it can't boot anymore {snip} sh: 2: vfio-pci: not found --Unable to start VM 101-- /usr/sbin/driverctl --nosave set-override 0000:00:14.0 vfio-pci: Failed (exit code 32512). at /var/lib/vz/snippets/hookscript-driverctl.pl line 59, <_CF> line 96. TASK ERROR: hookscript error for 101 on pre-start: command '/var/lib/vz/snippets/hookscript-driverctl.pl 101 pre-start' failed: exit code 127 `

@luby2999

Ew. That looks like the system is trying to exec vfio-pci as a command, rather than it being an argument on the full command (usr/sbin/driverctl --nosave set-override 0000:00:14.0 vfio-pci).

That seems like maybe you copy/pasted in extra line breaks.

If still interested in it, try grabbing it from here again and using the "Copy Raw Contents" button to get the buffer (double square icon on the right).

If somehow that causes the same error, please open a different ticket as it isn't related to this particular issue.