canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

LXD/LXC fails to release GPU on VM stop #12128

Open itzsimpl opened 1 year ago

itzsimpl commented 1 year ago

Required information

Click to see full * Distribution: Ubuntu * Distribution version: 22.04 * The output of * `lxc-start --version` ``` 5.0.0~git2209-g5a7b9ce67 ``` * `lxc-checkconfig` ``` LXC version 5.0.0~git2209-g5a7b9ce67 Kernel configuration not found at /proc/config.gz; searching... Kernel configuration found at /boot/config-6.2.0-26-generic --- Namespaces --- Namespaces: enabled Utsname namespace: enabled Ipc namespace: enabled Pid namespace: enabled User namespace: enabled Network namespace: enabled --- Control groups --- Cgroups: enabled Cgroup namespace: enabled Cgroup v1 mount points: Cgroup v2 mount points: /sys/fs/cgroup Cgroup v1 systemd controller: missing Cgroup v1 freezer controller: missing Cgroup ns_cgroup: required Cgroup device: enabled Cgroup sched: enabled Cgroup cpu account: enabled Cgroup memory controller: enabled Cgroup cpuset: enabled --- Misc --- Veth pair device: enabled, not loaded Macvlan: enabled, not loaded Vlan: enabled, not loaded Bridges: enabled, loaded Advanced netfilter: enabled, loaded CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded FUSE (for use with lxcfs): enabled, not loaded --- Checkpoint/Restore --- checkpoint restore: enabled CONFIG_FHANDLE: enabled CONFIG_EVENTFD: enabled CONFIG_EPOLL: enabled CONFIG_UNIX_DIAG: enabled CONFIG_INET_DIAG: enabled CONFIG_PACKET_DIAG: enabled CONFIG_NETLINK_DIAG: enabled File capabilities: Note : Before booting a new kernel, you can check its configuration usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig ``` * `uname -a` ``` Linux q1 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ``` * `cat /proc/self/cgroup` ``` 0::/user.slice/user-1000.slice/session-3.scope ``` * `cat /proc/1/mounts` ``` sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=263874368k,nr_inodes=65968592,mode=755,inode64 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=52797032k,mode=755,inode64 0 0 efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0 /dev/nvme0n1p2 / ext4 rw,relatime,stripe=32 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=92734 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0 configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0 ramfs /run/credentials/systemd-sysusers.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0 /dev/nvme0n1p1 /boot/efi vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0 /dev/loop0 /snap/core20/1852 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop1 /snap/core20/1974 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop2 /snap/core22/858 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop4 /snap/snapd/18596 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop3 /snap/lxd/25112 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop5 /snap/snapd/19457 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /run/snapd/ns tmpfs rw,nosuid,nodev,noexec,relatime,size=52797032k,mode=755,inode64 0 0 nsfs /run/snapd/ns/lxd.mnt nsfs rw 0 0 tmpfs /var/snap/lxd/common/ns tmpfs rw,relatime,size=1024k,mode=700,inode64 0 0 nsfs /var/snap/lxd/common/ns/shmounts nsfs rw 0 0 nsfs /var/snap/lxd/common/ns/mntns nsfs rw 0 0 tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=52797028k,nr_inodes=13199257,mode=700,uid=1000,gid=1000,inode64 0 0 lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 ```

Issue description

I have an Ubuntu:22.04 system with multiple GPUs, NVIDIA drivers 535 server installed, persistence mode is off. I pass individual GPUs to a VM as PCI passthrough. When I pass a single GPU to the VM, start it and then stop, the GPU is not returned to the host system (i.e. nvidia-smi does not show it anymore). When I pass multiple GPUs to the VM, start it and then stop, the GPU with the lowest PCI address on the host is not returned to the host system (i.e. nvidia-smi does not show it anymore), but the other GPUs get returned just fine.

Restarting the VMs again the GPUs are visible inside the VM, but if I start a container with nvidia-driver passthrough, only the GPUs that are currently visible on the host (i.e. all installed minus those that were not returned from the VMs earlier) are visible in the container. The only info I can find is that syslog says "Failed to stop device".

Steps to reproduce

  1. run nvidia-smi -L on host
  2. create VM with single GPU via passthrough
  3. start VM
  4. stop VM
  5. run nvidia-smi -L on host (the GPU that was passthrough to the VM will not be listed)
  6. create VM with multiple GPUs via passthrough
  7. start VM
  8. stop VM
  9. run nvidia-smi -L on host (the GPU will the lowest PCI address on the host that was passthrough to the VM will also not be listed)
    1. run container with nvidia-driver passthrough (same status as on the host)

Information to attach

Click to see full - VM log (`lxc info --show-log vm2`) ``` Name: vm2 Status: STOPPED Type: virtual-machine Architecture: x86_64 Created: 2023/08/07 22:14 UTC Last Used: 2023/08/07 23:19 UTC Log: qemu-system-x86_64: Issue while setting TUNSETSTEERINGEBPF: Invalid argument with fd: 83, prog_fd: -1 ``` - any relevant kernel output (`syslog`), the single GPU case ``` Aug 7 23:18:53 q1 kernel: [ 846.086912] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none Aug 7 23:18:53 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data Aug 7 23:18:54 q1 kernel: [ 846.548326] xhci_hcd 0000:ca:00.2: remove, state 4 Aug 7 23:18:54 q1 kernel: [ 846.548343] usb usb10: USB disconnect, device number 1 Aug 7 23:18:54 q1 kernel: [ 846.549060] xhci_hcd 0000:ca:00.2: USB bus 10 deregistered Aug 7 23:18:54 q1 kernel: [ 846.549083] xhci_hcd 0000:ca:00.2: remove, state 4 Aug 7 23:18:54 q1 kernel: [ 846.549091] usb usb9: USB disconnect, device number 1 Aug 7 23:18:54 q1 kernel: [ 846.550896] xhci_hcd 0000:ca:00.2: USB bus 9 deregistered Aug 7 23:18:54 q1 kernel: [ 846.653021] kauditd_printk_skb: 9 callbacks suppressed Aug 7 23:18:54 q1 kernel: [ 846.653026] audit: type=1400 audit(1691450334.129:54): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-vm2_" pid=5316 comm="apparmor_parser" Aug 7 23:18:53 q1 snapd[2334]: message repeated 3 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data] Aug 7 23:18:55 q1 systemd[3823]: Started snap.lxd.lxc.b9b13195-c7c3-46d4-842a-856565db2c99.scope. Aug 7 23:19:13 q1 kernel: [ 865.800363] vfio-pci 0000:ca:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Aug 7 23:19:13 q1 kernel: [ 865.800386] vfio-pci 0000:ca:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Aug 7 23:19:46 q1 systemd[3823]: Started snap.lxd.lxc.0a424bc8-95d2-4cb9-bdd0-468d3dbce737.scope. Aug 7 23:19:51 q1 systemd[3823]: Started snap.lxd.lxc.63564057-7dd7-462c-9548-3a5153ddd1e7.scope. Aug 7 23:19:51 q1 systemd[1]: Starting Cleanup of Temporary Directories... Aug 7 23:19:51 q1 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully. Aug 7 23:19:51 q1 systemd[1]: Finished Cleanup of Temporary Directories. Aug 7 23:19:54 q1 kernel: [ 907.246377] vfio-pci 0000:ca:00.0: Relaying device request to user (#0) Aug 7 23:20:01 q1 kernel: [ 913.710624] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:20:01 q1 kernel: [ 913.711376] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:20:01 q1 lxd.daemon[3076]: time="2023-08-07T23:20:01Z" level=error msg="Failed to stop device" device=gpu3 err="Failed probing device \"0000:ca:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default Aug 7 23:20:01 q1 systemd-networkd[2222]: mac6293c2ac: Link DOWN Aug 7 23:20:01 q1 systemd-networkd[2222]: mac6293c2ac: Lost carrier Aug 7 23:20:01 q1 kernel: [ 913.898141] audit: type=1400 audit(1691450401.373:55): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-vm2_" pid=10366 comm="apparmor_parser" Aug 7 23:32:42 q1 systemd[3823]: Started snap.lxd.lxc.44a0582a-97eb-4f56-9149-a7b6f2afec5b.scope. ``` - any relevant kernel output (`syslog`), two GPU case ``` Aug 7 23:45:38 q1 kernel: [ 2450.861745] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none Aug 7 23:45:38 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data Aug 7 23:45:38 q1 kernel: [ 2451.339448] xhci_hcd 0000:17:00.2: remove, state 4 Aug 7 23:45:38 q1 kernel: [ 2451.339464] usb usb4: USB disconnect, device number 1 Aug 7 23:45:38 q1 kernel: [ 2451.340164] xhci_hcd 0000:17:00.2: USB bus 4 deregistered Aug 7 23:45:38 q1 kernel: [ 2451.340188] xhci_hcd 0000:17:00.2: remove, state 4 Aug 7 23:45:38 q1 kernel: [ 2451.340197] usb usb3: USB disconnect, device number 1 Aug 7 23:45:38 q1 kernel: [ 2451.341944] xhci_hcd 0000:17:00.2: USB bus 3 deregistered Aug 7 23:45:40 q1 kernel: [ 2453.384621] vfio-pci 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none Aug 7 23:45:41 q1 kernel: [ 2453.867449] xhci_hcd 0000:31:00.2: remove, state 4 Aug 7 23:45:41 q1 kernel: [ 2453.867464] usb usb6: USB disconnect, device number 1 Aug 7 23:45:41 q1 kernel: [ 2453.868123] xhci_hcd 0000:31:00.2: USB bus 6 deregistered Aug 7 23:45:41 q1 kernel: [ 2453.868144] xhci_hcd 0000:31:00.2: remove, state 4 Aug 7 23:45:41 q1 kernel: [ 2453.868151] usb usb5: USB disconnect, device number 1 Aug 7 23:45:41 q1 kernel: [ 2453.869683] xhci_hcd 0000:31:00.2: USB bus 5 deregistered Aug 7 23:45:41 q1 kernel: [ 2453.966981] audit: type=1400 audit(1691451941.446:56): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-vm2_" pid=11010 comm="apparmor_parser" Aug 7 23:46:00 q1 kernel: [ 2472.883434] vfio-pci 0000:17:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Aug 7 23:46:00 q1 kernel: [ 2472.883457] vfio-pci 0000:17:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Aug 7 23:46:00 q1 kernel: [ 2473.055433] vfio-pci 0000:31:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Aug 7 23:46:00 q1 kernel: [ 2473.055455] vfio-pci 0000:31:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Aug 7 23:45:40 q1 snapd[2334]: message repeated 7 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data] Aug 7 23:47:16 q1 systemd[3823]: Started snap.lxd.lxc.c918eda7-03e8-4d84-9cb2-c9e1b4d6bfa2.scope. Aug 7 23:49:01 q1 kernel: [ 2653.889634] vfio-pci 0000:31:00.0: Relaying device request to user (#0) Aug 7 23:49:08 q1 kernel: [ 2660.602855] vfio-pci 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:49:08 q1 kernel: [ 2660.603292] nvidia 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none Aug 7 23:49:08 q1 kernel: [ 2660.690297] snd_hda_intel 0000:31:00.1: Disabling MSI Aug 7 23:49:08 q1 kernel: [ 2660.690325] snd_hda_intel 0000:31:00.1: Handle vga_switcheroo audio client Aug 7 23:49:08 q1 kernel: [ 2660.714786] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input19 Aug 7 23:49:08 q1 kernel: [ 2660.714916] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input20 Aug 7 23:49:08 q1 kernel: [ 2660.715088] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input21 Aug 7 23:49:08 q1 kernel: [ 2660.715283] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input22 Aug 7 23:49:08 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data Aug 7 23:49:08 q1 kernel: [ 2660.726602] xhci_hcd 0000:31:00.2: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.726615] xhci_hcd 0000:31:00.2: new USB bus registered, assigned bus number 3 Aug 7 23:49:08 q1 kernel: [ 2660.727221] xhci_hcd 0000:31:00.2: hcc params 0x0180ff05 hci version 0x110 quirks 0x0000000000000010 Aug 7 23:49:08 q1 kernel: [ 2660.727606] xhci_hcd 0000:31:00.2: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.727610] xhci_hcd 0000:31:00.2: new USB bus registered, assigned bus number 4 Aug 7 23:49:08 q1 kernel: [ 2660.727613] xhci_hcd 0000:31:00.2: Host supports USB 3.1 Enhanced SuperSpeed Aug 7 23:49:08 q1 kernel: [ 2660.727661] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.02 Aug 7 23:49:08 q1 kernel: [ 2660.727664] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 7 23:49:08 q1 kernel: [ 2660.727666] usb usb3: Product: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.727668] usb usb3: Manufacturer: Linux 6.2.0-26-generic xhci-hcd Aug 7 23:49:08 q1 kernel: [ 2660.727669] usb usb3: SerialNumber: 0000:31:00.2 Aug 7 23:49:08 q1 kernel: [ 2660.727830] hub 3-0:1.0: USB hub found Aug 7 23:49:08 q1 kernel: [ 2660.727837] hub 3-0:1.0: 2 ports detected Aug 7 23:49:08 q1 kernel: [ 2660.727975] usb usb4: We don't know the algorithms for LPM for this host, disabling LPM. Aug 7 23:49:08 q1 kernel: [ 2660.727993] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.02 Aug 7 23:49:08 q1 kernel: [ 2660.727995] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 7 23:49:08 q1 kernel: [ 2660.727997] usb usb4: Product: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.727999] usb usb4: Manufacturer: Linux 6.2.0-26-generic xhci-hcd Aug 7 23:49:08 q1 kernel: [ 2660.728000] usb usb4: SerialNumber: 0000:31:00.2 Aug 7 23:49:08 q1 kernel: [ 2660.728175] hub 4-0:1.0: USB hub found Aug 7 23:49:08 q1 kernel: [ 2660.728184] hub 4-0:1.0: 4 ports detected Aug 7 23:49:08 q1 snapd[2334]: message repeated 3 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data] Aug 7 23:49:08 q1 systemd[3823]: Reached target Sound Card. Aug 7 23:49:08 q1 kernel: [ 2660.807453] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:49:08 q1 kernel: [ 2660.807674] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:49:08 q1 lxd.daemon[3076]: time="2023-08-07T23:49:08Z" level=error msg="Failed to stop device" device=gpu0 err="Failed probing device \"0000:17:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default Aug 7 23:49:08 q1 systemd-networkd[2222]: mac43379c64: Link DOWN Aug 7 23:49:08 q1 systemd-networkd[2222]: mac43379c64: Lost carrier Aug 7 23:49:08 q1 kernel: [ 2661.011901] audit: type=1400 audit(1691452148.495:57): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-vm2_" pid=13584 comm="apparmor_parser" ``` - the VM configuration file ``` architecture: x86_64 config: agent.nic_config: "true" cloud-init.network-config: | version: 1 config: - type: physical name: eth0 subnets: - type: static ipv4: true address: 10.10.10.10/25 gateway: 10.10.10.1 control: auto - type: nameserver address: - 1.1.1.1 - 1.0.0.1 cloud-init.user-data: | #cloud-config ssh_import_id: [gh:itzsimpl] image.architecture: amd64 image.description: ubuntu 22.04 LTS amd64 (release) (20230729) image.label: release image.os: ubuntu image.release: jammy image.serial: "20230729" image.type: disk-kvm.img image.version: "22.04" limits.cpu: "20" limits.memory: 64GiB security.secureboot: "false" volatile.base_image: c3a32ce371819c4fb845867e8e602ad6a636e211cfaeca448e767de4b415c038 volatile.cloud-init.instance-id: f6fa9720-3024-4574-bbd7-e29a10e14ca0 volatile.eth0.hwaddr: 00:16:3e:73:46:f3 volatile.last_state.power: STOPPED volatile.last_state.ready: "false" volatile.uuid: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.uuid.generation: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.vsock_id: "1262936222" devices: eth0: name: eth0 nictype: macvlan parent: ens97f0np0 type: nic gpu0: gputype: physical pci: "0000:17:00.0" type: gpu gpu1: gputype: physical pci: "0000:31:00.0" type: gpu root: path: / pool: default size: 128GB type: disk ephemeral: false profiles: - default - pub-macvlan - gpu0 - gpu1 stateful: false description: vm2 ``` - LXD log ``` time="2023-08-07T23:05:21Z" level=warning msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored" time="2023-08-07T23:20:01Z" level=error msg="Failed to stop device" device=gpu3 err="Failed probing device \"0000:ca:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default time="2023-08-07T23:49:08Z" level=error msg="Failed to stop device" device=gpu0 err="Failed probing device \"0000:17:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default ``` - lxc monitor two GPU case ``` ... location: none metadata: context: device: gpu1 instance: vm2 instanceType: virtual-machine project: default type: gpu level: debug message: Stopping device timestamp: "2023-08-08T09:15:16.503683182Z" type: logging location: none metadata: context: device: gpu0 instance: vm2 instanceType: virtual-machine project: default type: gpu level: debug message: Stopping device timestamp: "2023-08-08T09:15:23.106079819Z" type: logging location: none metadata: context: device: gpu0 err: 'Failed probing device "0000:17:00.1" via "/sys/bus/pci/drivers_probe": write /sys/bus/pci/drivers_probe: invalid argument' instance: vm2 instanceType: virtual-machine project: default level: error message: Failed to stop device timestamp: "2023-08-08T09:15:23.147343221Z" type: logging ... ```
roosterfish commented 1 year ago

Hi @itzsimpl, I am trying to reproduce this by looking through the code.

In the logs you shared I can see device gpu3 can't be stopped but this device isn't present in the output of lxc config show vm2. At which point was this GPU added/removed?

Additionally could you share the output of lxc config show {vm} when it is running so we get both GPU config fields last_state.pci.slot.name and last_state.pci.driver.

For each GPU you add, please share if the following path exists: /sys/bus/pci/devices/{pci}/iommu_group/devices e.g. /sys/bus/pci/devices/0000:17:00.0/iommu_group/devices.

itzsimpl commented 1 year ago

@roosterfish I apologise, I was running multiple tests, trying to figure out what was going on. I have four GPUs. The gpu3 is from a test that I ran In a separate vm with the same configuration, but with just gpu3 attached. The gpu never got returned, hence that is what the log is showing.

Here are the requested outputs:

Click to see full - `lxc config show {vm}` while the vm is running after fresh boot: ``` architecture: x86_64 config: agent.nic_config: "true" cloud-init.network-config: | version: 1 config: - type: physical name: eth0 subnets: - type: static ipv4: true address: 10.10.10.10/25 gateway: 10.10.10.1 control: auto - type: nameserver address: - 1.1.1.1 - 1.0.0.1 cloud-init.user-data: | #cloud-config ssh_import_id: [gh:itzsimpl] image.architecture: amd64 image.description: ubuntu 22.04 LTS amd64 (release) (20230729) image.label: release image.os: ubuntu image.release: jammy image.serial: "20230729" image.type: disk-kvm.img image.version: "22.04" limits.cpu: "20" limits.memory: 64GiB security.secureboot: "false" volatile.base_image: c3a32ce371819c4fb845867e8e602ad6a636e211cfaeca448e767de4b415c038 volatile.cloud-init.instance-id: f6fa9720-3024-4574-bbd7-e29a10e14ca0 volatile.eth0.host_name: mac1c5a5e26 volatile.eth0.hwaddr: 00:16:3e:73:46:f3 volatile.eth0.last_state.created: "false" volatile.gpu0.last_state.pci.driver: nvidia volatile.gpu0.last_state.pci.slot.name: "0000:17:00.0" volatile.gpu1.last_state.pci.driver: nvidia volatile.gpu1.last_state.pci.slot.name: "0000:31:00.0" volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.uuid.generation: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.vsock_id: "1262936222" devices: eth0: name: eth0 nictype: macvlan parent: ens97f0np0 type: nic gpu0: gputype: physical pci: "0000:17:00.0" type: gpu gpu1: gputype: physical pci: "0000:31:00.0" type: gpu root: path: / pool: default size: 128GB type: disk ephemeral: false profiles: - default - pub-macvlan - gpu0 - gpu1 stateful: false description: vm2 ``` - `lxc config show {vm}` while the vm is running after turning it on and then off: ``` architecture: x86_64 config: agent.nic_config: "true" cloud-init.network-config: | version: 1 config: - type: physical name: eth0 subnets: - type: static ipv4: true address: 10.10.10.10/25 gateway: 10.10.10.1 control: auto - type: nameserver address: - 1.1.1.1 - 1.0.0.1 cloud-init.user-data: | #cloud-config ssh_import_id: [gh:itzsimpl] image.architecture: amd64 image.description: ubuntu 22.04 LTS amd64 (release) (20230729) image.label: release image.os: ubuntu image.release: jammy image.serial: "20230729" image.type: disk-kvm.img image.version: "22.04" limits.cpu: "20" limits.memory: 64GiB security.secureboot: "false" volatile.base_image: c3a32ce371819c4fb845867e8e602ad6a636e211cfaeca448e767de4b415c038 volatile.cloud-init.instance-id: f6fa9720-3024-4574-bbd7-e29a10e14ca0 volatile.eth0.host_name: macd02f8245 volatile.eth0.hwaddr: 00:16:3e:73:46:f3 volatile.eth0.last_state.created: "false" volatile.gpu0.last_state.pci.driver: vfio-pci volatile.gpu0.last_state.pci.slot.name: "0000:17:00.0" volatile.gpu1.last_state.pci.driver: nvidia volatile.gpu1.last_state.pci.slot.name: "0000:31:00.0" volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.uuid.generation: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.vsock_id: "1262936222" devices: eth0: name: eth0 nictype: macvlan parent: ens97f0np0 type: nic gpu0: gputype: physical pci: "0000:17:00.0" type: gpu gpu1: gputype: physical pci: "0000:31:00.0" type: gpu root: path: / pool: default size: 128GB type: disk ephemeral: false profiles: - default - pub-macvlan - gpu0 - gpu1 stateful: false description: vm2 ``` - contents of `/sys/bus/pci/devices/{pci}/iommu_group/devices` ```bash $ sudo ls /sys/bus/pci/devices/0000:17:00.0/iommu_group/devices 0000:17:00.0 0000:17:00.1 0000:17:00.2 0000:17:00.3 $ sudo ls /sys/bus/pci/devices/0000:31:00.0/iommu_group/devices 0000:31:00.0 0000:31:00.1 0000:31:00.2 0000:31:00.3 ```
itzsimpl commented 1 year ago

FWW. I tested on three other systems with A100, A30 and A5000 (in datacenter mode) GPUs. On all systems the GPUs get returned back to the host. The main difference that comes to my mind between these systems and where the GPU (Quadro RTX GPU) does not get returned is that the Axxxx GPUs have only one item in the IOMMU grpups (i.e. the 3D controller <pci>.0), whereas the Quadro RTX has (VGA compatible controller <pci>.0, Audio device <pci>.1, USB controller <pci>.2, and Serial bus controller <pci>.3).

itzsimpl commented 1 year ago

I've even tested with vGPU drivers and often the behaviour is very similar. The VM does shut down, but it does not release all resources. For example with two vGPUs (one per physical GPU), one does get released (and the memory as well), but the other does not. This can be seen via nvidia-smi which still shows a portion of the physical GPU memory as used and the vGPU process as running. Killing the process does not help as it does not release the physical GPU memory. Turning the VM on and off repeatedly quickly drains the physical GPU of all the memory. The only solution I've found is to reboot the host.

tomponline commented 1 year ago

@roosterfish I've assigned this to you until such time as you ascertain that your local setup isn't suitable for investigating this. Let me know. Thanks

tomponline commented 1 year ago

@roosterfish how are you getting on with this? If you're pushed for time maybe @MusicDin might be able to take a look?

roosterfish commented 1 year ago

@tomponline I am still not able to reproduce it. @MusicDin if you want to investigate I would appreciate it.

I suspect the issue somewhere here:

To solve the symptom, we could just not run the revert when stopping the device, but I don't know if this would bring other implications.

tomponline commented 1 year ago

@gabrielmougard please can take a look at this as you have a GPU.

itzsimpl commented 1 year ago

@roosterfish @tomponline FWW. On our system we were able to return the GPUs back to the host by first removing the devices and then rescanning the PCI. This works when the entire GPU is passed through, but not with mdev vGPUs. This of course is not a good solution, because it needs to be run on the host every time a VM is stopped and, furthermore, it does not resolve the vGPU case.


gpu_ids="17 31 b1 ca"

nvidia-smi -pm 0 # ensure persistence mode is off

for gpu in $gpu_ids; do

  vga=`lspci -nnk | grep -A 3 $gpu:00.0 | grep 'in use' | sed 's/.*: //g'`
  audio=`lspci -nnk | grep -A 3 $gpu:00.1 | grep 'in use' | sed 's/.*: //g'`
  usb=`lspci -nnk | grep -A 3 $gpu:00.2 | grep 'in use' | sed 's/.*: //g'`
  serial=`lspci -nnk | grep -A 3 $gpu:00.3 | grep 'in use' | sed 's/.*: //g'`

  [ "$vga" != "nvidia" ] && echo 1 >/sys/bus/pci/devices/0000\:$gpu\:00.0/remove
  [ "$audio" != "snd_hda_intel" ] && echo 1 >/sys/bus/pci/devices/0000\:$gpu\:00.1/remove
  [ "$usb" != "xhci_hcd" ] && echo 1 >/sys/bus/pci/devices/0000\:$gpu\:00.2/remove
  [ "$serial" != "nvidia-gpu" ] && echo 1 >/sys/bus/pci/devices/0000\:$gpu\:00.3/remove

  if [ "$vga" != "nvidia" ] || [ "$audio" != "snd_hda_intel" ] || [ "$usb" != "xhci_hcd" ] || [ "$serial" != "nvidia-gpu" ]; then
    echo 1 >/sys/bus/pci/rescan
  fi
done