cosminmocan / vfio-single-amdgpu-passthrough

This repo is a tutorial for single amd gpu passthrough to various qemu VMs
70 stars 3 forks source link

Vega 8? #3

Closed nift4 closed 2 years ago

nift4 commented 2 years ago

Can this work with an Vega 8 APU? Graphics device seems sane but GPU audio IOMMU group is connected with usb controller for my laptop, and I need that on host.

cosminmocan commented 2 years ago

Hello @nift4 ,

Post the iommu groups, otherwise we are just speculating. If the device is in the same group you might have some luck with ACS patch, but that can be tricky.

Also, why do you need the usb controller on the host if you already pass through the gpu ?

nift4 commented 2 years ago

My ethernet card is attached via USB (more precisely via an usb c dock) and I don't want to loose the ability to ssh in ;)

IOMMU groups:

IOMMU Group 0:
        00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 1:
        00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] [1022:15d3]
IOMMU Group 2:
        00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] [1022:15d3]
IOMMU Group 3:
        00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] [1022:15d3]
IOMMU Group 4:
        00:01.6 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] [1022:15d3]
IOMMU Group 5:
        00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
        00:08.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus B [1022:15dc]
        06:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 61)
IOMMU Group 6:
        00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A [1022:15db]
IOMMU Group 7:
        00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
        00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 8:
        00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 0 [1022:15e8]
        00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 1 [1022:15e9]
        00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 2 [1022:15ea]
        00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 3 [1022:15eb]
        00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 4 [1022:15ec]
        00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 5 [1022:15ed]
        00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 6 [1022:15ee]
        00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 7 [1022:15ef]
IOMMU Group 9:
        01:00.0 Non-Volatile memory controller [0108]: Toshiba Corporation BG3 NVMe SSD Controller [1179:0113] (rev 01)
IOMMU Group 10:
        02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 10)
IOMMU Group 11:
        03:00.0 SD Host controller [0805]: O2 Micro, Inc. SD/MMC Card Reader Controller [1217:8621] (rev 01)
IOMMU Group 12:
        04:00.0 Network controller [0280]: Qualcomm Atheros QCA9377 802.11ac Wireless Network Adapter [168c:0042] (rev 31)
IOMMU Group 13:
        05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd] (rev c4)
IOMMU Group 14:
        05:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Raven/Raven2/Fenghuang HDMI/DP Audio Controller [1002:15de]
        05:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
        05:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1 [1022:15e0]
        05:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1 [1022:15e1]
        05:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller [1022:15e3]
nift4 commented 2 years ago

ah and the monitor and everything is also attached through that dock so it would be unfortunate when I loose the usb controller

cosminmocan commented 2 years ago

In my opinion your IOMMU group will not allow you to do what you want. However, toss a coing and try that ACS patch, maybe you are lucky.

That Encryption controller will be the biggest issue in my oppinion. Also, if you have your monitor connected via usb c and you plan to see somethign after passing through your gpu, you will have to pass trough that device .

I might be wrong as I never did this stuff on laptop, I did Intel GVT-g in the past , but that was about it.

Try the acs patch see what it does for your IOMMU groups and report back with the resulting groups.

nift4 commented 2 years ago

That Encryption controller is just my TPM module and I don't think it is used in any means. I think I can also turn it off in my UEFI setup.

Also, if you have your monitor connected via usb c and you plan to see somethign after passing through your gpu, you will have to pass trough that device .

Hmm, true though. Maybe I can keep ssh with my onboard ethernet controller.

nift4 commented 2 years ago

I just read through what the ACS patch does. Technically I don't need it if I decide to pass whole groups, am I correct?

nift4 commented 2 years ago

My biggest concern is https://github.com/gnif/vendor-reset/issues/8#issuecomment-736479191

cosminmocan commented 2 years ago

Normally you would not be able to pass trough iGPU , again here I'm speaking with extremely low confidence as I did not try it myself :)

However , just because you would not be able to use vendor-reset does not mean that you would not be able to do passtrough, it would just mean that you will always have to do a clean shutdown in order for your main OS to regain the gpu.

And yes, if you would decide to passtrough the whole groupm you wouldn't need to use ACS.

Play around with it , see what happens .

As for your network card, connect the laptop via wifi ,and ssh trough that

MatthiasLohr commented 2 years ago

@cosminmocan, why to you think it is not possible to pass through iGPU? I'm trying it currently with a Proxmox Host/Win Guest, and it looks like that resetting is the last problem I have to solve for getting this to work.

cosminmocan commented 2 years ago

I did use iGPU in the past, on a 8000 series i7 from a laptop gvt-g If you are trying to passthrough the igpu normally you will probably fails as it probably shares the pcie group with other vital devices. If not, you are probably good to go .

If resetting is your only issue, try booting your linux using the nomodeset and connect to it via ssh from another device.

I am curious if you can get it working.

Share some logs, if i get the time, I will try to help you

MatthiasLohr commented 2 years ago

Thanks for your fast reply! I'm working on that issue with a Beelink GTR5 (product specification quite at the end of the page). I'm not 100% sure it's a Vega 8, but according to german Wikipedia page it should be a "Radeon Vega" (english Wikipedia lists something different).

The GPU shares the same IOMMU group with one of the SATA slots. image However, I'm making this to a feature and just assigned the whole SATA slot to the VM as well, which even removes the necessity to provide special VirtIO drivers to have Windows finding the drive.

I forwarded both, 05:00 and 06:00 with all features to the VM. When I try to start the VM (kvm), it leads to the following error message:

kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK ERROR: start failed: command '/usr/bin/kvm -id 100 -name vwin -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=e7683797-bc64-422a-8fae-49f3da34b404,serial=AC9S0420270612' -drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd' -drive 'if=pflash,unit=1,format=raw,id=drive-efidisk0,size=540672,file=/dev/pve/vm-100-disk-1' -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/100.vnc,password=on' -no-hpet -cpu 'kvm64,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep' -m 16384 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=de7cb652-d945-43f6-b294-3ecf343d9ff3' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:05:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=0000:05:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' -device 'vfio-pci,host=0000:05:00.2,id=hostpci0.2,bus=ich9-pcie-port-1,addr=0x0.2' -device 'vfio-pci,host=0000:05:00.3,id=hostpci0.3,bus=ich9-pcie-port-1,addr=0x0.3' -device 'vfio-pci,host=0000:05:00.4,id=hostpci0.4,bus=ich9-pcie-port-1,addr=0x0.4' -device 'vfio-pci,host=0000:05:00.5,id=hostpci0.5,bus=ich9-pcie-port-1,addr=0x0.5' -device 'vfio-pci,host=0000:05:00.6,id=hostpci0.6,bus=ich9-pcie-port-1,addr=0x0.6' -device 'vfio-pci,host=0000:06:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=0000:06:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' -chardev 'socket,id=tpmchar,path=/var/run/qemu-server/100.swtpm' -tpmdev 'emulator,id=tpmdev,chardev=tpmchar' -device 'tpm-tis,tpmdev=tpmdev' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -chardev 'socket,path=/var/run/qemu-server/100.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:fb5259cc630' -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' -drive 'file=/var/lib/vz/template/iso/Win10_21H2_EnglishInternational_x64.iso,if=none,id=drive-sata0,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' -drive 'file=/var/lib/vz/template/iso/virtio-win.iso,if=none,id=drive-sata1,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ahci0.1,drive=drive-sata1,id=sata1,bootindex=101' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=7A:0B:39:8A:B7:19,netdev=net0,bus=pci.0,addr=0x12,id=net0' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc-q35-6.1+pve0' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout

Searching for the vfio: Unable to power on device, stuck in D3 error message, I stumbled across this issue.

Adding nomodeset (with updating grub and rebooting) unfortunately didn't change anything. Current boot parameters:

quiet iommu=pt amd_iommu=on video=efifb:off nomodeset

Current vfio module settings:

options vfio-pci ids=1002:1638,1002:1637,1022:15df,1022:1639,1022:15e2,1022:15e3,1022:7901 disable_vga=1 disable_idle_d3=1

If you have any idea, please let me know - I appreciate any help! Thank you very much for the offer!

MatthiasLohr commented 2 years ago

Anything new on this issue?

Pandaaaa906 commented 2 years ago

@MatthiasLohr i got a similar nuc with 5900hx. i used acs patch to separate iommu group. but it got error code 43 in the windows vm.

cosminmocan commented 2 years ago

@MatthiasLohr I would like to help you , but for that i need some structured info from you, have you tried the acs patch? It should change the iommu groups a little.

@Pandaaaa906 Had that issue before, maybe try using a vm that hav q35 as the chipset, and ovmf (uefi).

MatthiasLohr commented 2 years ago

@MatthiasLohr I would like to help you , but for that i need some structured info from you, have you tried the acs patch? It should change the iommu groups a little.

That would be great, thanks in advance!

No, I haven't tried yet, since I'm not sure what it does exactly and how I should do it exactly. I found several options, but unsure about the way to go.

What I have done so far: https://gist.github.com/MatthiasLohr/8111bbdea62061f4cc3e01fe0bc8d071

cosminmocan commented 2 years ago

@MatthiasLohr From what i saw, nothing looks out of place, so my next attempt would be the following: Try to export the gpu firmware/rom(gpu-z to the rescue), and use that when passing it trough , found this video in chinese for you : https://www.youtube.com/watch?v=qOLss24FAP8

Also , after doing so and starting the vm, please , open a ssh to your proxmox , and get the dmesg outut and provide it here

MatthiasLohr commented 2 years ago

@cosminmocan I'm not able to extract the firmware/rom. When I try with GPU-Z, I'm getting BIOS reading not supported on this device. I also tried with https://github.com/SpaceinvaderOne/Dump_GPU_vBIOS on linux, here the script tries to access a file called /sys/bus/pci/devices/<id>/rom, which does not exist on my system (actually nowhere in /sys/bus/pci is a file called rom on my system).

Some more information on my graphics card: gtr5graphics https://www.techpowerup.com/gpu-specs/radeon-vega-8-mobile.c3771

When I'm trying to boot the VM with passed-through graphics card (obviously without the rom file), I'm getting the following output on dmesg:

[  337.925843] xhci_hcd 0000:05:00.3: remove, state 4
[  337.925870] usb usb2: USB disconnect, device number 1
[  337.926117] xhci_hcd 0000:05:00.3: USB bus 2 deregistered
[  337.926138] xhci_hcd 0000:05:00.3: remove, state 1
[  337.926150] usb usb1: USB disconnect, device number 1
[  337.926162] usb 1-4: USB disconnect, device number 2
[  337.986636] xhci_hcd 0000:05:00.3: USB bus 1 deregistered
[  338.097705] xhci_hcd 0000:05:00.4: remove, state 4
[  338.097735] usb usb4: USB disconnect, device number 1
[  338.098011] xhci_hcd 0000:05:00.4: USB bus 4 deregistered
[  338.098036] xhci_hcd 0000:05:00.4: remove, state 1
[  338.098049] usb usb3: USB disconnect, device number 1
[  338.098063] usb 3-3: USB disconnect, device number 2
[  338.098958] usb 3-4: USB disconnect, device number 3
[  338.115541] xhci_hcd 0000:05:00.4: USB bus 3 deregistered
[  338.752426] device tap100i0 entered promiscuous mode
[  338.756878] vmbr0: port 2(tap100i0) entered blocking state
[  338.756894] vmbr0: port 2(tap100i0) entered disabled state
[  338.756967] vmbr0: port 2(tap100i0) entered blocking state
[  338.756980] vmbr0: port 2(tap100i0) entered forwarding state
[  340.192681] vfio-pci 0000:05:00.0: enabling device (0002 -> 0003)
[  340.192946] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  340.192963] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  340.192974] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  340.192985] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  340.192997] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  340.193583] vfio-pci 0000:05:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[  340.194218] vfio-pci 0000:05:00.1: enabling device (0000 -> 0002)
[  340.253121] vfio-pci 0000:05:00.2: enabling device (0000 -> 0002)
[  340.254965] vfio-pci 0000:05:00.3: enabling device (0000 -> 0002)
[  340.310791] vfio-pci 0000:05:00.4: enabling device (0000 -> 0002)
[  340.385059] vfio-pci 0000:05:00.5: enabling device (0000 -> 0002)
[  340.433059] vfio-pci 0000:05:00.6: enabling device (0000 -> 0002)
[  342.068944] vfio-pci 0000:05:00.6: vfio_bar_restore: reset recovery - restoring BARs
[  342.084936] vfio-pci 0000:05:00.5: vfio_bar_restore: reset recovery - restoring BARs
[  342.100936] vfio-pci 0000:05:00.4: vfio_bar_restore: reset recovery - restoring BARs
[  342.132936] vfio-pci 0000:05:00.3: vfio_bar_restore: reset recovery - restoring BARs
[... last lines repeated ~40 times ...]

Any idea? Anything else I can/should try?

MatthiasLohr commented 2 years ago

Not sure if I interprete these logs right, but could it be that beside the acs patch thing, too many things (e.g., USB) get disconnected when I start the VM with the GPU passed through?

MatthiasLohr commented 2 years ago

Ok, I removed the "all functions" setting, I'm just forwarding now the main graphics device(?) 0000:05:00.0. Errors in dmesg:

[   43.250271] vmbr0: port 2(tap100i0) entered blocking state
[   43.250286] vmbr0: port 2(tap100i0) entered disabled state
[   43.250354] vmbr0: port 2(tap100i0) entered blocking state
[   43.250364] vmbr0: port 2(tap100i0) entered forwarding state
[   43.269727] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
[   44.701241] vfio-pci 0000:05:00.0: enabling device (0002 -> 0003)
[   44.701452] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[   44.701469] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[   44.701480] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[   44.701492] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[   44.701503] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[   44.702115] vfio-pci 0000:05:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
WSt89 commented 1 year ago

Hello, I am stuck in same situation as you have described. Does anybody have any update on this?

MatthiasLohr commented 1 year ago

Unfortunately, no, but still happy to learn about any progress on this.

WSt89 commented 1 year ago

@Naunter has reported a successfull passthrough of a 5700G iGPU (which is also a Vega 8) in thread #https://gist.github.com/matt22207/bb1ba1811a08a715e32f106450b0418a. However, GPU reset bug still remains.

I have not tried to follow instructions yet, as I would need to enable CSM in BIOS of host machine first.

perfectnewer commented 8 months ago

[ 44.702115] vfio-pci 0000:05:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref] try add kernel command line: initcall_blacklist=sysfb_init or unbind frame buffer、vtconsole before start vm