Closed cdknight closed 3 years ago
Update: I had to also update the hooks and it seems like things are working :D I might submit a PR for my PCI ID if this works out, perhaps.
What is nvidia-smi reporting for you? For me it's still showing the 1060 but im not sure if thats intended.
Also added the ID's now but on Proxmox i get the following:
[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured
Is it UUID related?
Add this to your VM config file (in /etc/pve/qemu-server/
):
args: -uuid 00000000-0000-0000-0000-000000000100
It should work after that.
Edit: to further answer your question, yes, it is UUID related. The vGPU manager requires that you have the UUID as a QEMU argument, or it won't let the VM start.
Allright looks like I'm recieving the same error now you had
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000100 GPU PCI id 00:01:00.0 config params vgpu
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=63
nvidia-vgpu-mgr[10884]: notice: vmiop_env_log: Successfully updated env symbols!
nvidia-vgpu-mgr[10884]: error: vmiop_log: NVOS status 0x56
nvidia-vgpu-mgr[10884]: error: vmiop_log: Assertion Failed at 0xd3940183:293
nvidia-vgpu-mgr[10884]: error: vmiop_log: 10 frames returned by backtrace
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv004938vgpu+0x26) [0x7fb3d39906a6]
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0x88a7a) [0x7fb3d393ea7a]
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0x8a183) [0x7fb3d3940183]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x4119f1]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x412955]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x40d1fc]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x40ae74]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x4035da]
nvidia-vgpu-mgr[10884]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fb3d3e0b09b]
nvidia-vgpu-mgr[10884]: error: vmiop_log: vgpu() [0x403621]
nvidia-vgpu-mgr[10884]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (error setting vGPU configuration information from RM)
nvidia-vgpu-mgr[10884]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
nvidia-vgpu-mgr[10884]: error: vmiop_log: display_init failed for inst: 0
nvidia-vgpu-mgr[10884]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
nvidia-vgpu-mgr[10884]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1f
vgpu_unlock at line 109 i added:
actual_devid == 0x1c03 || // GTX 1060 6GB
vgpu_unlock_hooks.c at line 719 i added:
case 0x1c03: /* GTX 1060 6GB */
followed by
dkms remove -m nvidia -v 460.32.04 --all
dkms install -m nvidia -v 460.32.04
Did I forget something?
Your GPU might have a different device ID than mine.
What you can do is you go into the vgpu_unlock script. Before the actual_devid == 0x1c03
part, before the if statements,
add this line:
console.log("Actual devid is " + actual_devid")
// GP102
if (
Then log in as root, and run /opt/vgpu_unlock/vgpu_unlock /usr/bin/nvidia_vgpud
. It won't do anything, but it will print out the actual_devid
. Then you can convert the output to hexadecimal and you will find your PCI ID. Replace the 1c03 with what you find.
Edit: You don't need to convert to hexadecimal, but it looks more streamlined if you do.
There might be an easier way to find the PCI ID, but this works for me.
/opt/vgpu_unlock-master/vgpu_unlock /usr/bin/nvidia-vgpud
Errors out:
Traceback (most recent call last):
File "/opt/vgpu_unlock-master/vgpu_unlock", line 222, in <module>
main()
File "/opt/vgpu_unlock-master/vgpu_unlock", line 212, in main
instrument(pid)
File "/opt/vgpu_unlock-master/vgpu_unlock", line 170, in instrument
script = session.create_script(script_source)
File "/usr/local/lib/python3.7/dist-packages/frida/core.py", line 26, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/frida/core.py", line 204, in create_script
return Script(self._impl.create_script(*args, **kwargs))
frida.InvalidArgumentError: script(line 79): SyntaxError: unexpected end of string
#77 if(status == STATUS_TRY_AGAIN) {
#78 // Driver will try again.
#79 return;
#80 }
Wich makes no sense. Im not a python guy so IDK. Any logging I add will error.
Anyway this is what i get from lspci:
lspci -s 01:00
01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
lspci -n -s 01:00
01:00.0 0300: 10de:1c03 (rev a1)
01:00.1 0403: 10de:10f1 (rev a1)
Looks the same 🤔
Hmm, did you add the entire thing?
console.log("Actual devid is " + actual_devid")
// GP102
I meant that you should only add the first thing, but before the GP102 comment. That would be my only explanation for the syntax error. It still doesn't make sense that yours isn't working though, since my lspci -n -s
shows the same thing...
Well since it is a script the " escaped the whole thing.
Heres the output
actual_devid = 7171
spoofed_devid = 7171
actual_subsysid = 34230
spoofed_subsysid = 34230
It sounds like your DKMS module is what's the issue from the logs. It would tell you you have an unsupported card if the Python script were wrong, but the issue you're getting is an unlock hook thing.
For me it's working perfectly at this point, so I'm not too sure where you went wrong, but it always helps to just start over from scratch (that's what I did, went from openSUSE → Proxmox). Also make sure you're using the vgpu-kvm
driver (not the grid
one) since I know that's a mistake I made.
Also, might be unrelated, but did you enable IOMMU? You have to do that for it to work IIUC.
Yes IOMMU is enabled. Currently I'm using the pci passthrough (ofc i disabled stuff before i attempted the vGPU driver)
I'm not sure about the driver tbh since my registration @ nvidia is not getting through. Using wild stuff i found on google. Maybe someone can share the package?
Your driver is likely outdated or something. What I found on Google wasn't working either (and it was for XenServer, not a generic installer). For me the registration at NVIDIA here took about 2 minutes. I would recommend trying on a different email address (I used Protonmail and that worked just fine).
Well i used protonmail aswell. Will give it another shot.
This is actually not an error. The GTX 1060 6GB that has PCI device ID 1C03 contains the GP106 chip. The GP106 does not appear on any GPU supported by vGPU. I have therefore assumed that it is not possible to use those 1060s, and that PCI device ID does not appear in the code.
If any of you have been able to get this working by spoofing it as a Tesla P4 (GP104), then my assumption was wrong and vGPU might be a bit more flexible then I thought.
Interesting. Yes, it definitely works with vGPU as I am running two VMs right now and am passing my GPU to both of them. I wonder if, in that case, support for other non-supported GPUs might be possible (eg. as I referenced in another thread, the GTX 780 spoofed as something like the GRID K2)?
I may test this later, since I have two GTX 780s.
@DualCoder GP106 confirmed working as P4, we are adding loads of PCI IDs including that for GP106 cards to vgpu_unlock.
What is nvidia-smi reporting for you? For me it's still showing the 1060 but im not sure if thats intended.
Also added the ID's now but on Proxmox i get the following:
[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured
Is it UUID related?
Hi,
I have a very similar error with a 1080ti after adding args: -uuid 00000000-0000-0000-0000-000000000100
.
VM start errors out:
[ 2.813886] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 2.814266] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4.950393] audit: type=1400 audit(1618303969.884:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=608 comm="apparmor_parser"
[ 4.950395] audit: type=1400 audit(1618303969.884:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=608 comm="apparmor_parser"
[ 5.639081] nvidia 0000:01:00.0: MDEV: Registered
[ 45.753371] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x1
If I don't include the args: -uuid 00000000-0000-0000-0000-000000000100
, my error is:
[ 2.752442] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 2.752829] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4.886399] audit: type=1400 audit(1618303279.819:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=589 comm="apparmor_parser"
[ 4.886401] audit: type=1400 audit(1618303279.819:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=589 comm="apparmor_parser"
[ 5.580480] nvidia 0000:01:00.0: MDEV: Registered
[ 169.065653] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured
These IDs were adding a while ago, closing. However, I might also add that GP106 works with P40 as well, which may provide some additional profiles.
@darabontors you may get some mileage trying Environment="__RM_NO_VERSION_CHECK=1"
before the ExecStart in both of the systemd files for the vgpu-mgr
and vgpud
. For more support, join the Discord server.
Hi everyone,
Thanks very much for this work. I've been wanting to try out vGPUs for a very, very long time, and this might make my dreams come true, so it's very exciting.
I attempted to follow the instructions and
nvidia-vgpud
said I had an unsupported vGPU (I have a GTX 1060 6GB, which should be supported, right?).I added this to the
vgpu_unlock
script, which made nvidia-vgpud "work" (as in, it exits with an error code of zero.Here are the systemd logs for what I mean by
nvidia-vgpud
exiting:I'm not certain this is what's supposed to happen (shouldn't it keep running?)
I went and created an mdev, following the instructions here.
When I added the mdev to libvirt, I used the following XML
I get the following error when starting the VM, though:
Dmesg says:
Did I do something wrong? Should I be using CentOS/RHEL instead of openSUSE?
I then found out that the systemd service
nvidia-gpu-mgr
is a thing. These were the logs:I set ExecStart to
/opt/vgpu_unlock/vgpu_unlock /usr/bin/nvidia-vgpu-mgr
(in hopes that wouldn't help), and now I have:Is there something I'm missing, or is my setup just wrong/not supported, or did I mess up something, or… is this a bug that my GPU doesn't work?