Document more explicitly that you have to add a custom PCI ID for GPUs that are supported but are not in the default list

cdknight commented 3 years ago

Hi everyone,

Thanks very much for this work. I've been wanting to try out vGPUs for a very, very long time, and this might make my dreams come true, so it's very exciting.

I attempted to follow the instructions and nvidia-vgpud said I had an unsupported vGPU (I have a GTX 1060 6GB, which should be supported, right?).

I added this to the vgpu_unlock script, which made nvidia-vgpud "work" (as in, it exits with an error code of zero.

                // GP104
                if(actual_devid == 0x1b80 || // GTX 1080
                   actual_devid == 0x1b81 || // GTX 1070
                   actual_devid == 0x1b82 || // GTX 1070 Ti
                   actual_devid == 0x1c03 || // GTX 1060 6GB, **mine**
                   actual_devid == 0x1b83 || // GTX 1060 6GB
                   actual_devid == 0x1b84 || // GTX 1060 3GB
                   actual_devid == 0x1bb0) { // Quadro P5000
                    spoofed_devid = 0x1bb3; // Tesla P4
                }

Here are the systemd logs for what I mean by nvidia-vgpud exiting:

Apr 09 22:37:29 localhost nvidia-vgpud[5660]: Number of Displays: 1
Apr 09 22:37:29 localhost nvidia-vgpud[5660]: Max pixels: 8847360
Apr 09 22:37:29 localhost nvidia-vgpud[5660]: Display: width 4096, height 2160
Apr 09 22:37:29 localhost nvidia-vgpud[5660]: License: NVIDIA-vComputeServer,9.0;Quadro-Virtual-DWS,5.0
Apr 09 22:37:29 localhost nvidia-vgpud[5660]: PID file unlocked.
Apr 09 22:37:29 localhost nvidia-vgpud[5660]: PID file closed.
Apr 09 22:37:29 localhost nvidia-vgpud[5660]: Shutdown (5660)

I'm not certain this is what's supposed to happen (shouldn't it keep running?)

I went and created an mdev, following the instructions here.

When I added the mdev to libvirt, I used the following XML

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci'>
  <source>
    <address uuid='84c0de53-9363-478d-876c-b298956a4af1 '/>
  </source>
</hostdev>

I get the following error when starting the VM, though:

qemu-system-x86_64: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/84c0de53-9363-478d-876c-b298956a4af1,display=off,bus=pci.5,addr=0x0: vfio 84c0de53-9363-478d-876c-b298956a4af1: error getting device from group 8: Input/output error

Verify all devices in group 8 are bound to vfio-<bus> or pci-stub and not already in use

Dmesg says:

[nvidia-vgpu-vfio] 84c0de53-9363-478d-876c-b298956a4af1: start failed. status: 0x1

Did I do something wrong? Should I be using CentOS/RHEL instead of openSUSE?

I then found out that the systemd service nvidia-gpu-mgr is a thing. These were the logs:

Apr 09 22:37:29 localhost notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 84c0de53-9363-478d-876c-b29>
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=63
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: notice: vmiop_env_log: Successfully updated env symbols!
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: error: vmiop_log: (0x0): vGPU is supported only on VGX capable boards
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (vGPU validation of the GPU failed)
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: error: vmiop_log: display_init failed for inst: 0
Apr 09 22:37:29 localhost nvidia-vgpu-mgr[5614]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error

I set ExecStart to /opt/vgpu_unlock/vgpu_unlock /usr/bin/nvidia-vgpu-mgr (in hopes that wouldn't help), and now I have:

 nvidia-vgpu-mgr[2314]: notice: vmiop_env_log: Successfully updated env symbols!
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: NVOS status 0x56
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: Assertion Failed at 0x8429e183:293
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: 10 frames returned by backtrace
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: /usr/lib64/libnvidia-vgpu.so(_nv004938vgpu+0x26) [0x7f49842ee6a6]
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: /usr/lib64/libnvidia-vgpu.so(+0x88a7a) [0x7f498429ca7a]
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: /usr/lib64/libnvidia-vgpu.so(+0x8a183) [0x7f498429e183]
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: vgpu() [0x4119f1]
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: vgpu() [0x412955]
Apr 09 22:49:09 localhost nvidia-vgpu-mgr[2314]: error: vmiop_log: vgpu() [0x40d1fc]

Is there something I'm missing, or is my setup just wrong/not supported, or did I mess up something, or… is this a bug that my GPU doesn't work?

cdknight commented 3 years ago

Update: I had to also update the hooks and it seems like things are working :D I might submit a PR for my PCI ID if this works out, perhaps.

Saschanski commented 3 years ago

What is nvidia-smi reporting for you? For me it's still showing the 1060 but im not sure if thats intended.

Also added the ID's now but on Proxmox i get the following:

[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured

Is it UUID related?

cdknight commented 3 years ago

Add this to your VM config file (in /etc/pve/qemu-server/):

args: -uuid 00000000-0000-0000-0000-000000000100

It should work after that.

Edit: to further answer your question, yes, it is UUID related. The vGPU manager requires that you have the UUID as a QEMU argument, or it won't let the VM start.

Saschanski commented 3 years ago

Allright looks like I'm recieving the same error now you had

nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000100 GPU PCI id 00:01:00.0 config params vgpu
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=63
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: Successfully updated env symbols!
nvidia-vgpu-mgr[10884]: error: vmiop_log: NVOS status 0x56
nvidia-vgpu-mgr[10884]: error: vmiop_log: Assertion Failed at 0xd3940183:293
nvidia-vgpu-mgr[10884]: error: vmiop_log: 10 frames returned by backtrace
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv004938vgpu+0x26) [0x7fb3d39906a6]
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0x88a7a) [0x7fb3d393ea7a]
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0x8a183) [0x7fb3d3940183]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x4119f1]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x412955]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x40d1fc]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x40ae74]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x4035da]
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fb3d3e0b09b]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x403621]
nvidia-vgpu-mgr[10884]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (error setting vGPU configuration information from RM)
nvidia-vgpu-mgr[10884]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
nvidia-vgpu-mgr[10884]: error: vmiop_log: display_init failed for inst: 0
nvidia-vgpu-mgr[10884]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
nvidia-vgpu-mgr[10884]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1f

vgpu_unlock at line 109 i added:

actual_devid == 0x1c03 || // GTX 1060 6GB

vgpu_unlock_hooks.c at line 719 i added:

case 0x1c03: /* GTX 1060 6GB */

followed by

dkms remove -m nvidia -v 460.32.04 --all
dkms install -m nvidia -v 460.32.04

Did I forget something?

cdknight commented 3 years ago

Your GPU might have a different device ID than mine.

What you can do is you go into the vgpu_unlock script. Before the actual_devid == 0x1c03 part, before the if statements, add this line:

console.log("Actual devid is " + actual_devid")

// GP102
if (

Then log in as root, and run /opt/vgpu_unlock/vgpu_unlock /usr/bin/nvidia_vgpud. It won't do anything, but it will print out the actual_devid. Then you can convert the output to hexadecimal and you will find your PCI ID. Replace the 1c03 with what you find.

Edit: You don't need to convert to hexadecimal, but it looks more streamlined if you do.

There might be an easier way to find the PCI ID, but this works for me.

Saschanski commented 3 years ago

/opt/vgpu_unlock-master/vgpu_unlock /usr/bin/nvidia-vgpud

Errors out:

Traceback (most recent call last):
  File "/opt/vgpu_unlock-master/vgpu_unlock", line 222, in <module>
    main()
  File "/opt/vgpu_unlock-master/vgpu_unlock", line 212, in main
    instrument(pid)
  File "/opt/vgpu_unlock-master/vgpu_unlock", line 170, in instrument
    script = session.create_script(script_source)
  File "/usr/local/lib/python3.7/dist-packages/frida/core.py", line 26, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/frida/core.py", line 204, in create_script
    return Script(self._impl.create_script(*args, **kwargs))
frida.InvalidArgumentError: script(line 79): SyntaxError: unexpected end of string

#77 if(status == STATUS_TRY_AGAIN) {
#78     // Driver will try again.
#79     return;
#80 }

Wich makes no sense. Im not a python guy so IDK. Any logging I add will error.

Anyway this is what i get from lspci:

lspci -s 01:00

01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)

lspci -n -s 01:00

01:00.0 0300: 10de:1c03 (rev a1)
01:00.1 0403: 10de:10f1 (rev a1)

Looks the same 🤔

cdknight commented 3 years ago

Hmm, did you add the entire thing?

console.log("Actual devid is " + actual_devid")

// GP102

I meant that you should only add the first thing, but before the GP102 comment. That would be my only explanation for the syntax error. It still doesn't make sense that yours isn't working though, since my lspci -n -s shows the same thing...

Saschanski commented 3 years ago

Well since it is a script the " escaped the whole thing.

Heres the output

actual_devid =  7171
spoofed_devid =  7171
actual_subsysid =  34230
spoofed_subsysid =  34230

cdknight commented 3 years ago

It sounds like your DKMS module is what's the issue from the logs. It would tell you you have an unsupported card if the Python script were wrong, but the issue you're getting is an unlock hook thing.

For me it's working perfectly at this point, so I'm not too sure where you went wrong, but it always helps to just start over from scratch (that's what I did, went from openSUSE → Proxmox). Also make sure you're using the vgpu-kvm driver (not the grid one) since I know that's a mistake I made.

Also, might be unrelated, but did you enable IOMMU? You have to do that for it to work IIUC.

Saschanski commented 3 years ago

Yes IOMMU is enabled. Currently I'm using the pci passthrough (ofc i disabled stuff before i attempted the vGPU driver)

I'm not sure about the driver tbh since my registration @ nvidia is not getting through. Using wild stuff i found on google. Maybe someone can share the package?

cdknight commented 3 years ago

Your driver is likely outdated or something. What I found on Google wasn't working either (and it was for XenServer, not a generic installer). For me the registration at NVIDIA here took about 2 minutes. I would recommend trying on a different email address (I used Protonmail and that worked just fine).

Saschanski commented 3 years ago

Well i used protonmail aswell. Will give it another shot.

DualCoder commented 3 years ago

This is actually not an error. The GTX 1060 6GB that has PCI device ID 1C03 contains the GP106 chip. The GP106 does not appear on any GPU supported by vGPU. I have therefore assumed that it is not possible to use those 1060s, and that PCI device ID does not appear in the code.

If any of you have been able to get this working by spoofing it as a Tesla P4 (GP104), then my assumption was wrong and vGPU might be a bit more flexible then I thought.

cdknight commented 3 years ago

Interesting. Yes, it definitely works with vGPU as I am running two VMs right now and am passing my GPU to both of them. I wonder if, in that case, support for other non-supported GPUs might be possible (eg. as I referenced in another thread, the GTX 780 spoofed as something like the GRID K2)?

I may test this later, since I have two GTX 780s.

KrutavShah commented 3 years ago

@DualCoder GP106 confirmed working as P4, we are adding loads of PCI IDs including that for GP106 cards to vgpu_unlock.

darabontors commented 3 years ago

What is nvidia-smi reporting for you? For me it's still showing the 1060 but im not sure if thats intended.

Also added the ID's now but on Proxmox i get the following:

[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured

Is it UUID related?

Hi,

I have a very similar error with a 1080ti after adding args: -uuid 00000000-0000-0000-0000-000000000100. VM start errors out:

[    2.813886] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[    2.814266] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    4.950393] audit: type=1400 audit(1618303969.884:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=608 comm="apparmor_parser"
[    4.950395] audit: type=1400 audit(1618303969.884:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=608 comm="apparmor_parser"
[    5.639081] nvidia 0000:01:00.0: MDEV: Registered
[   45.753371] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x1

If I don't include the args: -uuid 00000000-0000-0000-0000-000000000100, my error is:

[    2.752442] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[    2.752829] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    4.886399] audit: type=1400 audit(1618303279.819:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=589 comm="apparmor_parser"
[    4.886401] audit: type=1400 audit(1618303279.819:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=589 comm="apparmor_parser"
[    5.580480] nvidia 0000:01:00.0: MDEV: Registered
[  169.065653] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured

cdknight commented 3 years ago

These IDs were adding a while ago, closing. However, I might also add that GP106 works with P40 as well, which may provide some additional profiles.

@darabontors you may get some mileage trying Environment="__RM_NO_VERSION_CHECK=1" before the ExecStart in both of the systemd files for the vgpu-mgr and vgpud. For more support, join the Discord server.

DualCoder / vgpu_unlock

Document more explicitly that you have to add a custom PCI ID for GPUs that are supported but are not in the default list #9