DualCoder / vgpu_unlock

Unlock vGPU functionality for consumer grade GPUs.
MIT License
4.63k stars 430 forks source link

Tesla M40 Problems & Memory Allocation Limit with Tesla M40 24GB -> Tesla M60 remapping #62

Open BlaringIce opened 3 years ago

BlaringIce commented 3 years ago

First and primary: I'm coming from a setup where I was using a GTX 1060 with vgpu_unlock just fine, but figured I'd step it up so that I could support more VMs. So, I'm currently trying to use a Tesla M40. Being a Tesla card, you might expect not to need vgpu_unlock, but this is one of the few Tesla's that doesn't support it natively. So, I'm trying to use nvidia-18 types from the M60 profiles with my VMs. I'm aware that I should be using a slightly older driver to match my host driver. However, I'm still getting a code 43 when I load my guest. I would provide some logs here, but I'm not sure what I can include since the entries for the two vgpu services both seem to be fine with no errors other than nvidia-vgpu-mgr[2588]: notice: vmiop_log: display_init inst: 0 successful at the end of trying to initialize the mdev device when the VM starts up. Please let me know any other information that I can provide to help debug/troubleshoot. Second: This is probably one of the few instances where this is a problem since most GeForce/Quadro cards have less memory than their vGPU capable counterparts. However, I have a Tesla M40 GPU that has 24 GB of vRAM (in two separate memory regions I would guess, although this SKU isn't listed on the Nvidia graphics processing units Wikipedia page, so I'm not 100% sure). This is in comparison to the Tesla M60's 2x8GB configuration, of which, only 8GB is available for allocation in vGPU. I'm not sure whether the max_instance quantity, as seen in mdevctl types, is defined on the Nvidia driver side, in the vgpu_unlock side, or if it's a mix and the vgpu_unlock side might be able to do something about it. What I'm asking here, though, is whether this value can be redefined so that I can utilize all 24 GB of my available vRAM or, if not that, then at least the 12 GB that I presume is available in the GPU's primary memory.

DualCoder commented 3 years ago

However, I'm still getting a code 43 when I load my guest.

Code 43 is sort of Nvidia's catch-all error, it doesn't really provide any useful information. I think you have two options:

  1. Create a new VM with a clean config/bios/disk and reinstall Windows and the matching drivers from scratch.
  2. Test with a Linux guest. The Linux drivers tend to provide human readable error messages. If needed, you can create the file /etc/modprobe.d/nvidia.conf with the line options nvidia NVreg_ResmanDebugLevel=0 to enable verbose output from the driver (this works on both linux hosts and guests).

What I'm asking here, though, is whether this value can be redefined so that I can utilize all 24 GB of my available vRAM or, if not that, then at least the 12 GB that I presume is available in the GPU's primary memory.

It seems that the M60 is quite special. If you compare the specs here: https://www.pny.eu/en/consumer/explore-all-products/legacy-products/602-tesla-m60-r2l https://www.pny.eu/en/consumer/explore-all-products/legacy-products/696-tesla-m40-24gb You can see that the M60 explicitly states "16 GB GDDR5 (8 GB per board)", so I would expect your M40 to be technically capable of using all 24GB. However, the profiles available are determined by the driver and the current version of vgpu_unlock does not attempt to alter them in any way. If you do get the card working I will see if I can create a workaround, it would also be useful for utilizing the full 11GB of a 1080ti.

ualdayan commented 3 years ago

I'm having issues with a M40 too. Dmesg wasn't returning anything, but eventually I figured out I needed to go into hooks.c and turn on logging. Oddly though I still don't see any of the syslog stuff from the main script file anywhere in logs, but now I do at least see vGPU unlock patch applied. Remap called.

I also saw 'nvidia-vgpu-mgr[4819]: op_type: 0xa0810115 failed.'

Still error 43 in windows with 443.18 drivers, in Linux it says 'probe of 0000:01:00.0 failed with error -1'.

Also just tried passing it right through without modifying IDs, then installing drivers that were bundled with the Linux VGPU drivers, it recognized it as an Nvidia GRID M60-2Q, but still failed with code 43.

BlaringIce commented 3 years ago

Well, I've made the decision to go ahead and return the card while I'm still inside of the return window. I'll likely still have the card for a day or two if there's anything specific I can try. As for what I found since the initial post: I made a Linux guest, which I'm admittedly not as familiar with running nvidia drivers for as I've only use Linux with Nvidia accelerate graphics on an older machine with a GTX 650. I first tried to install on xubuntu 20.04 but I couldn't figure out how their built in store's drivers worked and I got warnings about using the store instead when I tried to install manually. So, after that I tried switching over to Rocky Linux 8 (closer to the environment I'm familiar with from the GTX 650). In the latter driver the output wasn't as verbose and pretty much literally just said that it couldn't load the 'nvidia-drm' kernel module during the install process then quit. The older driver that I tried gave a little more info in its dkms make log: /var/lib/dkms/nvidia/450.66/build/nvidia/nv-pci.c: In function 'nv_pci_probe': /var/lib/dkms/nvidia/450.66/build/nvidia/nv-pci.c:427.5: error: implicit declaration of function 'vga_tryget'; did you mean 'vga_get'? [-Werror=implicit-function-declaration] vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK); ^~~~~~~~~~ vga_get cc1: some warnings being treated as errors But that doesn't tell me much either other than that maybe the, probably older, kernel and/or gcc version in Rocky Linux may not be happy with compiling the driver code. I can try any ideas that anyone else has during the time that I still have the card.

ualdayan commented 3 years ago

When you start a guest does nvidia-smi report anything under processes for you? For me on the M40 it always returns 'No running processes found'.

DualCoder commented 3 years ago

Still error 43 in windows with 443.18 drivers, in Linux it says 'probe of 0000:01:00.0 failed with error -1'.

Have you tried enabling verbose logging using options nvidia NVreg_ResmanDebugLevel=0 (see my previous comment) and did that provide any more information?

When you start a guest does nvidia-smi report anything under processes for you? For me on the M40 it always returns 'No running processes found'.

It is supposed to list a vgpu process for each running VM, but if the driver fails to load in the VM then it is probably not listed, so you should focus on getting the guest driver to work.

I also saw 'nvidia-vgpu-mgr[4819]: op_type: 0xa0810115 failed.'

These op_type: 0xNNNNN failed. messages can be ignored unless they are immediately followed by a more serious looking error.

ualdayan commented 3 years ago

I enabled verbose logging and here's the log entries:

This on repeat: Jul 24 14:47:05 pop-os kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236 Jul 24 14:47:05 pop-os kernel: Jul 24 14:47:05 pop-os kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none Jul 24 14:47:05 pop-os kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:17f0) NVRM: installed in this system is not supported by the NVRM: NVIDIA 465.31 driver release. NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: in this release's README, available on the operating system NVRM: specific graphics driver download page at www.nvidia.com. Jul 24 14:47:05 pop-os kernel: nvidia: probe of 0000:01:00.0 failed with error -1 Jul 24 14:47:05 pop-os kernel: NVRM: The NVIDIA probe routine failed for 1 device(s). Jul 24 14:47:05 pop-os kernel: NVRM: None of the NVIDIA devices were initialized. Jul 24 14:47:05 pop-os kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 236 Jul 24 14:47:05 pop-os systemd-udevd[562]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1. Jul 24 14:47:06 pop-os kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236

And then this: Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly. Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon.

This is with passing through the devid of a Quadro M6000, and using drivers that claim to support Quadro M6000s.

Random thought: If you pass the M40 directly through to a virtual machine at first it doesn't work because it's in some kind of compute only mode, but then after you change the driver mode (nvidia-smi -g 0 -dm 0) it starts to function more like a regular GPU. It doesn't seem to be a persistent thing - eg it's saved somewhere in the registry of the Windows VM rather than somewhere on the card itself. In Linux nvidia-smi tells you the mode can't be changed. What if it's stuck in some kind of compute mode in Linux (but not in Windows), and that's why it isn't enabling vGPU since compute mode has to be off for the other Tesla cards before vGPU can be enabled?

DualCoder commented 3 years ago

Jul 24 14:47:05 pop-os kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:17f0) NVRM: installed in this system is not supported by the NVRM: NVIDIA 465.31 driver release.

Two problems here:

  1. The PCI ID 10DE:17F0 is for the Quadro M6000. This should be one of the M60-\<digit>\<letter> vGPU profiles.
  2. The driver 465.31 is not listed as a vGPU driver here: https://docs.nvidia.com/grid/index.html

So please try again without any PCI spoofing tricks in the qemu configuration and use an officially supported driver version.

Random thought: If you pass the M40 directly through to a virtual machine at first it doesn't work because it's in some kind of compute only mode, but then after you change the driver mode (nvidia-smi -g 0 -dm 0) it starts to function more like a regular GPU. It doesn't seem to be a persistent thing - eg it's saved somewhere in the registry of the Windows VM rather than somewhere on the card itself. In Linux nvidia-smi tells you the mode can't be changed. What if it's stuck in some kind of compute mode in Linux (but not in Windows), and that's why it isn't enabling vGPU since compute mode has to be off for the other Tesla cards before vGPU can be enabled?

There might be something to this, Nvidia provides the gpumodeswitch tool to change the Tesla M60 and M6 cards between compute and graphics mode: https://docs.nvidia.com/grid/12.0/grid-gpumodeswitch-user-guide/index.html As far as I can tell this is a persistent change and the card will store the active mode in on-board non-volatile memory. Maybe nvidia-smi -a -i 0 can be used to read out the mode?

BlaringIce commented 3 years ago

Well, I wasn't able to use nvidia-smi to tell, but I did try the gpumodeswitch tools. Sure enough, the card is in compute mode: Tesla M40 (10DE,17FD,10DE,1173) H:--:NRM S:00,B:21,PCI,D:00,F:00 Adapter: Tesla M40 (10DE,17FD,10DE,1173) H:--:NRM S:00,B:21,PCI,D:00,F:00

Identifying EEPROM... EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page InfoROM Version : G600.0200.02.02

Tesla M40 (10DE,17FD,10DE,1173) --:NRM 84.00.56.00.03 InfoROM Version : G600.0200.02.02 GPU Mode : Compute

From there I was able to use ./gpumodeswitch --gpumode graphics --auto and a quick reboot later, I was in Graphics mode (identical output to above, but replacing "Compute" with "Graphics"). Note: I had to use dkms to temporarily uninstall my host driver during this process since gpumodeswitch did not like it running at the same time.

Unfortunately, after this I reinstalled the driver but I'm still getting a code 43 in Windows and in Linux I'm still having trouble even installing the driver. I did finally realize that I need to blacklist nouveau but I'm still getting errors. Just running the installer normally, I get an error about the DRM-KMS module not being built correctly. I'm not sure if excluding that from the compilation would be a problem, but I gave it a shot with the --no-drm flag. That didn't do much better though: ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA GPU(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

I have a couple ideas on some more things I can try. I'll report back if I have any more positive changes.

BlaringIce commented 3 years ago

Well, I've tried my couple other ideas. Unfortunately, I had no luck with any of them either. I upgrade to host driver version 460.73.02. Tried the windows guest from there with 462.31 with no luck. Moved on to linux from there. I did finally get the driver to install, technically, (version 460.73.01 this time) but I did still need to use the --no-drm flag to do it. Now that it's installed, nvidia-smi still gives me the error where it can't communicate with the driver. And... I'm not really sure if the output is that helpful since it doesn't look significantly different from @ualdayan 's output, but here's the results from running dmesg | grep -i nvidia

[ 5.737475] nvidia: loading out-of-tree module taints kernel. [ 5.737487] nvidia: module license 'NVIDIA' taints kernel. [ 5.751686] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 5.763114] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 5.764336] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.764496] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e) NVRM: NVIDIA 460.73.01 driver release. NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: specific graphics driver download page at www.nvidia.com. [ 5.765856] nvidia: probe of 0000:01:00.0 failed with error -1 [ 5.765880] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 5.765881] NVRM: None of the NVIDIA devices were initialized. [ 5.766848] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241 [ 29.386194] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 29.387996] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none [ 29.388154] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e) NVRM: NVIDIA 460.73.01 driver release. NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: specific graphics driver download page at www.nvidia.com. [ 29.388841] nvidia: probe of 0000:01:00.0 failed with error -1 [ 29.388881] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 29.388881] NVRM: None of the NVIDIA devices were initialized. [ 29.389445] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241 [ 30.584459] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 30.586215] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none [ 30.586376] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e) NVRM: NVIDIA 460.73.01 driver release. NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: specific graphics driver download page at www.nvidia.com. [ 30.586991] nvidia: probe of 0000:01:00.0 failed with error -1 [ 30.587041] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 30.587042] NVRM: None of the NVIDIA devices were initialized. [ 30.587214] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241 [ 33.497634] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 33.504624] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none [ 33.504782] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e) NVRM: NVIDIA 460.73.01 driver release. NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: specific graphics driver download page at www.nvidia.com. [ 33.505502] nvidia: probe of 0000:01:00.0 failed with error -1 [ 33.505535] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 33.505536] NVRM: None of the NVIDIA devices were initialized. [ 33.509144] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241

I'm not really sure how assigning the PCI device ID works when you're normally passing a device through with vGPU, but I tried looking up the GRID M60-2Q profile that I'm using and found a result that said it should be 114e, so that's what I tried. Hopefully that right.

Anyways, please let me know if there's anything else I can try out.

BlaringIce commented 3 years ago

Oh, forgot to mention that on the host side I keep getting messages that say: [nvidia-vgpu-vfio] [[DEVICE UUID HERE]]: vGPU migration disabled I'm not sure if that really means anything important in this situation though. I'm assuming it means migrating to another GPU, and since I don't have one, I guess that makes sense.

ualdayan commented 3 years ago

My Tesla M40 doesn't seem to be compatible with gpumodeswitch like yours. For me it says: Identifying EEPROM... EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page NOTE: Preserving straps from original image. Command id:1000000E Command: NV_UCODE_CMD_COMMAND_VV failed Command Status:NV_UCODE_CMD_STS_NEW Error: NV_UCODE_ERR_CODE_CMD_VBIOS_VERIFY_BIOS_SIG_FAIL

Command id:000E Command: NV_UCODE_CMD_COMMAND_VV failed Command Status:NV_UCODE_CMD_STS_NONE Error: NV_UCODE_ERR_CODE_CMD_VBIOS_VERIFY_BIOS_SIG_FAIL

BCRT Error: Certificate 2.0 verification failed

ERROR: BIOS Cert 2.0 Verifications Error, Update aborted.

DualCoder commented 3 years ago

[nvidia-vgpu-vfio] [[DEVICE UUID HERE]]: vGPU migration disabled

This is expected since Qemu/KVM does not support the migration feature of the vGPU drivers.

[ 5.764496] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e) NVRM: NVIDIA 460.73.01 driver release.

Ok, this is an improvement, the driver 460.73.01 is supported, but the PCI ID 10de:114e is weird. There is no NVIDIA device with that ID. Can you provide the output of lspci -vvnn for both the host and guest?

I'm not really sure how assigning the PCI device ID works when you're normally passing a device through with vGPU, but I tried looking up the GRID M60-2Q profile that I'm using and found a result that said it should be 114e, so that's what I tried. Hopefully that right.

Are you assigning it manually? Why? If you insist on setting it yourself, it should be:

Vendor ID: 0x10DE (NVIDIA)
Device ID: 0x13F2 (Tesla M60)
Subsystem Vendor ID: 0x10DE (NVIDIA)
Subsystem Device ID: 0x114E (M60-2Q)
BlaringIce commented 3 years ago

Are you assigning it manually? Why? If you insist on setting it yourself, it should be

I am assigning manually, but only really out of ignorance for the 'normal' way you would do it. I'll change the parameters to match what you have there, at least.

Can you provide the output of lspci -vvnn for both the host and guest?

Sure, see the attached files. hostlspcivvnn.txt guestlspcivvnn.txt

BlaringIce commented 3 years ago

I'm wondering if there's a possibility here that the GM200 chipset is just built, up and down the stack, not to support vGPU at a hardware level. Since it's marketed as incompatible on the only Tesla card that uses that silicon (the M40), and the other cards (980 Ti, Maxwell Titan X, and Quadro M6000) wouldn't be expected to have it work anyways, maybe it just totally locked out? I'm not sure if Nvidia would really go to such lengths to design it that way since it would deviate from their lower-tier designs. Do you know of any confirmed cases of someone getting vGPU working on a 980 Ti, Titan X, or M6K?

DualCoder commented 3 years ago

Can you provide the output of lspci -vvnn for both the host and guest?

Sure, see the attached files.

These looks correct, the dive shows up as a VGA controller with a 256 MB BAR1, so it is in the correct graphics mode. And the device in the guest shows up with the correct PCI IDs.

I am assigning manually, but only really out of ignorance for the 'normal' way you would do it.

I'm guessing that you are setting it either using the qemu command line with an argument like -set device.hostdev0.x-pci-vendor-id=NNNN or in a libvirt xml file with something like:

<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev0.x-pci-vendor-id=NNNN'/>

so the "normal" way is to not pass those arguments/xml elements (i.e remove them).

I'm wondering if there's a possibility here that the GM200 chipset is just built, up and down the stack, not to support vGPU at a hardware level. Since it's marketed as incompatible on the only Tesla card that uses that silicon (the M40), and the other cards (980 Ti, Maxwell Titan X, and Quadro M6000) wouldn't be expected to have it work anyways, maybe it just totally locked out? I'm not sure if Nvidia would really go to such lengths to design it that way since it would deviate from their lower-tier designs.

There is a possibility that there exists some technical limitation that prevents this from working, yes. But vGPU is a software solution and doesn't rely on the existence of some special hardware feature to function, however if the hardware is special in its design (like the GTX 970's 3.5+0.5 memory layout) it might be incompatible.

Do you know of any confirmed cases of someone getting vGPU working on a 980 Ti, Titan X, or M6K?

I do not.

I'll change the parameters to match what you have there, at least.

Now it looks correct, does the driver still complain about the device being unsupported?

BlaringIce commented 3 years ago

Well, specifically nvidia-smi says:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I'm not sure how to check the driver itself. I'm not sure what the service name is for the normal linux guest drivers and a quick Google didn't really give me anything nor did tab completion give me anything with systemctl status nvi[tab here]. Otherwise I would try to check the driver itself instead of just SMI. I tried reinstalling the driver, too, once I reset the IDs to what you'd said and it still complained about DRM-KMS, so I did have to use the --no-drm flag to install.

DualCoder commented 3 years ago

It should install without the --no-drm flag now that the IDs are correct. For nvidia-smi not working I would check for errors in dmesg on both host and guest, and journalctl -u nvidia-vgpu-mgr on the host. Otherwise /var/log/Xorg.0.log might give some info if X fails to start, but I don't think it will even try if you installed with --no-drm.

What error does the installer give that prevents it from installing without --no-drm?

DualCoder commented 3 years ago

When looking around I noticed that there are two 460.73.01 drivers, you want the -grid version, check the checksum.

sha256sum NVIDIA-Linux-x86_64-460.73.01*
d10eda9780538f9c7a222aa221405f51cb31e2b7d696b2c98b751cc0fd6e037d  NVIDIA-Linux-x86_64-460.73.01-grid.run
11b1c918de26799e9ee3dc5db13d8630922b6aa602b9af3fbbd11a9a8aab1e88  NVIDIA-Linux-x86_64-460.73.01.run

I also found that Google publishes the files here https://cloud.google.com/compute/docs/gpus/grid-drivers-table

The non-grid version explicitly lists the 24GB M40 as supported, so I do not understand why it refuses to work.

BlaringIce commented 3 years ago

Well, I was able to load the GRID version of the driver without any errors during install. Having done so now has nvidia-smi giving the very uninteresting output of No devices were found

With the GRID driver installed I get an output for dmesg | grep -i nvidia of: [ 2.743559] nvidia: loading out-of-tree module taints kernel. [ 2.743569] nvidia: module license 'NVIDIA' taints kernel. [ 2.756062] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 2.766318] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 2.767463] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 2.768253] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.73.01 Thu Apr 1 21:40:36 UTC 2021 [ 2.813590] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.73.01 Thu Apr 1 21:32:31 UTC 2021 [ 2.817926] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 2.817930] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1 [ 6.833690] NVRM: nvidia_open... [ 6.833696] NVRM: nvidia_ctl_open [ 6.835930] NVRM: nvidia_open... [ 6.900405] NVRM: nvidia_open... [ 6.949965] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 6.949966] NVRM: nvidia_ctl_close [ 6.950939] NVRM: nvidia_open... [ 6.950942] NVRM: nvidia_ctl_open [ 6.951226] NVRM: nvidia_open... [ 7.006033] NVRM: nvidia_open... [ 7.059159] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 7.059160] NVRM: nvidia_ctl_close [ 9.538961] NVRM: nvidia_open... [ 9.538965] NVRM: nvidia_ctl_open [ 9.542569] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 9.542571] NVRM: nvidia_ctl_close [ 10.190619] NVRM: nvidia_open... [ 10.190622] NVRM: nvidia_ctl_open [ 10.346947] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 10.346949] NVRM: nvidia_ctl_close [ 10.817518] NVRM: nvidia_open... [ 10.817523] NVRM: nvidia_ctl_open [ 11.028236] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 11.028239] NVRM: nvidia_ctl_close [ 11.327625] NVRM: nvidia_open... [ 11.327629] NVRM: nvidia_ctl_open [ 11.390885] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 11.390887] NVRM: nvidia_ctl_close [ 11.932444] NVRM: nvidia_open... [ 11.932448] NVRM: nvidia_ctl_open [ 12.004938] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 12.004939] NVRM: nvidia_ctl_close [ 12.394800] NVRM: nvidia_open... [ 12.394807] NVRM: nvidia_ctl_open [ 12.504682] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 12.504684] NVRM: nvidia_ctl_close [ 15.071319] NVRM: nvidia_open... [ 15.071326] NVRM: nvidia_ctl_open [ 15.104593] NVRM: nvidia_open... [ 15.142268] NVRM: nvidia_open... [ 61.436038] NVRM: nvidia_open... [ 61.436042] NVRM: nvidia_ctl_open [ 61.436364] NVRM: nvidia_open... [ 61.476481] NVRM: nvidia_open... [ 61.515963] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255 [ 61.515964] NVRM: nvidia_ctl_close

BlaringIce commented 3 years ago

Same query on the host give... this. hostdmesg.log

DualCoder commented 3 years ago

It looks like it tries to load now (then fails, then tries again, ...). But I can't see any error being printed, can you provide the log without the grep -i nvidia filter? Also, make sure that the nouveau driver is properly blacklisted in the guest.

BlaringIce commented 3 years ago

Unfortunately this will probably be my last post regarding the M40 - maybe someone else can pick this up in the future, but I've got to return mine now and I've got an M60 in that I can try out instead. Thank you for all the help with this though!

Here are the full guest and host dmesg logs in case they reveal something useful: guestdmesgfull.log

BlaringIce commented 3 years ago

hostdmesgfull.log

haywoodspartan commented 3 years ago

I was able to get the vGPU to split with the Tesla M40 24GB with Proxmox Host and the vgpu_unlock script. If needed I had to use a hacky way of doing it with a spoof on the vgpu itself. However I am limited to only doing 1 vgpu on this card at any given time for some reason. I wonder if there was a way to give it more availability since I have the VRAM to do it and I have ECC disabled on the card. For testing purposes I have this on my home server behind a load balancer. If the dev wants to mess around with it he can if he needs a working debug environment.

DualCoder commented 3 years ago

If needed I had to use a hacky way of doing it with a spoof on the vgpu itself.

That's interesting, did you see the same issues as reported previously in this issue? Do you mind sharing details on this "hacky way"?

However I am limited to only doing 1 vgpu on this card at any given time for some reason. I wonder if there was a way to give it more availability since I have the VRAM to do it and I have ECC disabled on the card.

I assume this means you were able to create a single vGPU instance, assign it to a VM, and then load the drivers inside the VM to get hardware acceleration. This would be good news for the Tesla M40. In order to use multiple instances at the same time you should check the following:

If you can provide error messages or log files that would be helpful too.

FallingSnow commented 2 years ago

@BlaringIce How did you get your M40 into graphics mode? Mine won't seem to switch. I've even restarted a few times.

cl1# ./gpumodeswitch --gpumode graphics --auto

NVIDIA GPU Mode Switch Utility Version 1.23.0
Copyright (C) 2015, NVIDIA Corporation. All Rights Reserved.

Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00
Adapter: Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page

Programming UPR setting for requested mode..
License image updated successfully.

Programming ECC setting for requested mode..
The display may go *BLANK* on and off for up to 10 seconds or more during the update process depending on your display adapter and output device.

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
NOTE: Preserving straps from original image.
Clearing original firmware image...
Storing updated firmware image...
.................
Verifying update...
Update successful.

Firmware image has been updated from version 84.00.48.00.01 to 84.00.48.00.01.

A reboot is required for the update to take effect.

InfoROM image updated successfully.

cl1# ./gpumodeswitch --version                

NVIDIA GPU Mode Switch Utility Version 1.23.0
Copyright (C) 2015, NVIDIA Corporation. All Rights Reserved.

Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00
Adapter: Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
InfoROM Version : G600.0202.02.01

Tesla M40        (10DE,17FD,10DE,1171) --:NRM 84.00.48.00.01
InfoROM Version  : G600.0202.02.01
GPU Mode         : Compute
haywoodspartan commented 2 years ago

I will have to get back to this project as my workload as of late has required me to deploy OpenStack Xena on my homelab setup for Work purposes. However OpenStack does allow for MDev devices and NVidia vGPU Virtual machines on a KVM Type system

FallingSnow commented 2 years ago

I figured it out. My vbios was out of date. lspci now shows a 256MB bar partition.

$ lspci -v
03:00.0 VGA compatible controller: NVIDIA Corporation GM200GL [Tesla M40] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation GM200GL [Tesla M40]
    Flags: bus master, fast devsel, latency 0, IRQ 69, IOMMU group 0
    Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 7fe0000000 (64-bit, prefetchable) [size=256M]
    Memory at 7ff0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at f000 [size=128]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

@haywoodspartan I'm using the 12GB variant. I was able to split the GPU and spoof a M6000 instance to my VM. However I'm battling the dreaded code 43 right now.

FallingSnow commented 2 years ago

I ended up giving up on the M40, kept unloading the guest driver. I put in a 1070 ti and it worked perfectly.

TymanLS commented 2 years ago

In case this helps at all, I'm observing the same code 43 behavior when using a Tesla M40 12GB; however, I am using the Merged-Rust-Drivers which uses the Rust-based vGPU unlock. I'm not sure how much of this information applies specifically to this codebase, but hopefully it can provide some insight into the process of unlocking vGPU in general.

I am testing on a Proxmox 7.1 OS with the 5.11 kernel manually installed, since the kernel patches for the merged driver didn't work for the 5.13 kernel (and I was running into some other unrelated Matrox video card bugs with the 5.15 kernel). With a few tweaks here and there, I was able to get to where the text "vGPU unlock patch applied" shows up in the output of dmesg, mdevctl types showed the list of available vGPU types for the Tesla M60, and I was able to create a GRID M60-4Q vGPU instance and assign it to a VM. However, I am running into issues seemingly with the guest driver in a Windows VM, where it will either return a code 43 error or BSOD when trying to install/load the driver; if I recall correctly, the BSOD errors were pretty much always SYSTEM_SERVICE_EXCEPTION (nvlddmkm.sys).

The GRID guest driver (list of them mentioned here) gave a code 43 error when I tried it. Since the merged driver was based on the 460.73.01 Linux driver, I chose the 462.31 Windows GRID guest driver which corresponds to the same GRID version (12.2) according to NVIDIA's website. I also tried spoofing the vGPU's PCI ID within the VM by specifying the x-pci-vendor-id and x-pci-device-id parameters in the QEMU configuration file. I spoofed a Quadro M6000 like @FallingSnow, but the normal Quadro M6000 drivers would also code 43. I tried multiple versions of the Quadro drivers including multiple 46x versions, a 47x version, and the latest version; none worked, and they all either gave a code 43 or BSOD. Additionally I tried to spoof a GTX 980, since I thought that card would be the closest to the GRID M60-4Q vGPU I was using; the GTX 980 used a GM204 GPU like the Tesla M60, and it came standard with 4GB of VRAM. Once again I got a code 43 error when trying to use the standard GeForce drivers for the GTX 980.

Another thing to note is that I have not made any changes to my VBIOS since getting the card. I did get it off eBay though, so I suppose anything is possible. I also did NOT attempt to set the GPU into graphics mode, my output of lspci -v shows a 16GB instead of a 256MB bar partition.

I am very interested in the configuration that @haywoodspartan described that allowed him to get vGPU working. From what I've researched so far (not much), I've only ever heard of two instances of a Tesla M40 being successfully used with vGPU: haywoodspartan's post in this issue thread, and Jeff from CraftComputing in this clip (though he also mentioned the 8GB VRAM limit). Notably, both of these instances were using the 24GB variant of the Tesla M40.

Let me know if there is any testing I can do to help assist with the project, I would absolutely love to get this Tesla M40 working in some remote gaming desktops!

FallingSnow commented 2 years ago

@TymanLS You need to apply 5.14 patch to Proxmox 5.13 kernel to get it to compile. Same for Proxmox 5.15 kernel and 5.16 patch. And yes I've tried these kernels too lol. I've also tried the Rust drivers, same issues.

What I learned was non VGPU drivers (Quadro/GeForce) will never work with VGPUs. I assume this is because the guest driver has to communicate with the host driver. Since you need to use the VGPU driver, only VGPUs are supported (no point in spoofing). 47x drivers do not work with this (this is confirmed in the discord). Never got a blue screen but code 43 always popped up.

TymanLS commented 2 years ago

@FallingSnow Thanks for the advice on patching the kernels, I may try that in some of my future testing. I thought the 5.11 kernel didn't require patches though, which is part of the reason I decided to downgrade back to that kernel version for testing. I figured it might have been easier to troubleshoot if there were fewer things I had to mess around with.

Also, regarding the use of non vGPU guest drivers inside the VMs, I believe Craft Computing was able to make that work in his vGPU tutorial. Granted, he was running on an even older version of Proxmox (6.4 with kernel 5.4, if I remember correctly), and he was using an RTX 2080 instead of a Tesla M40. However, he was able to spoof his vGPU profile as a Quadro RTX 6000 within the Windows VM, install the regular Quadro drivers, and have it work. I think it's also interesting that his RTX 2080 was able to spoof as a Quadro RTX 6000 on both the host and in the guest, since those cards don't use the same chip; the RTX 2080 uses the TU104 whereas the Quadro RTX 6000 uses the TU102. I was hoping that the situation would be similar here since the Tesla M40 uses the GM200 chip, which is different from the GM204 chip of the Tesla M60 that it pretends to be on the host. However, since the actual GRID Windows guest driver is also giving me a code 43 when I'm not spoofing the guest GPU, I'm guessing the problem might not be related to the VM guest driver.

I might try downgrading the host driver version to see if that makes a difference. I'm using the drivers for vGPU 12.1, which is vGPU driver 460.73.02, Linux guest driver 460.73.01, and Windows guest driver 462.31. I'm using those versions because that's what the Rust merged driver is using, so I figured it had a better chance of working. Craft Computing was using even older drivers; he was using the drivers for vGPU 11.1, which is vGPU driver version 450.80, Linux guest driver 450.80.02, and Windows guest driver 452.39.

FallingSnow commented 2 years ago

Only 5.13 and up require patches. It's just that they require patches for the next version up.

Hmm, that is interesting. I guess I stand corrected about the VGPU drivers then.

I hope you end up finding a solution. The M40 has turned out to be too much of a hassle for me.

jiangcuo commented 2 years ago

@TymanLS , I have the same problem as you。 pve7 with kernel 5.15-30,m40-12g。Merged Driver

I noticed that in successful examples both used Tesla M40 24GB and vGPU 14 eg: https://blog.zematoxic.com/06/03/2022/Tesla-M40-vGPU-Proxmox-7-1/ eg:https://www.youtube.com/watch?v=jTXPMcBqoi8

republicus commented 2 years ago

I have a Tesla M40 working well, using both this repo and and vgpu_unlock_rs. 510.47.03 Driver on Proxmox host.

Like many people have done, my working config passes through to Windows guests an NVIDIA Quadro M6000. It works great. I do not experience any error 43 issues or problems with performance or drivers on Windows 10 or 11.

What brought me here was my attempt to get Linux guests to enjoy the same benefits.

After some tweaks the only way I can get linux working at all is to pass-through a specific GRID device. An unchanged device with no PCI ID changes passed through an M60 -- which would not work with any proprietary NVIDIA drivers.

After changing the PCI IDs -- the Linux guest works great until the official driver goes into limp mode (triggered at 20 minutes uptime and slows the freq and sets a 15 FPS cap). I observe the same performance with the Windows driver going into limp mode when using the unlicensed official vGPU driver for Windows.

The PCI ID that works in Linux guests: [code]# PCI ID for GRID M60 0B

pci_id = 0x13F2114E

pci_device_id = 0x13F2[/code]

It would appear, unlike the Windows drivers, that the Linux proprietary drivers for Quadro and Tesla/Compute cards do not share the same instructions for vGPU capabilities. I have tried a series of different PCI IDs and drivers with no joy.

FallingSnow commented 2 years ago

@republicus You have an M40 24GB right? It's the M40 12GB that doesn't work.

dulasau commented 2 years ago

I have the same problem with M40 12GB, I was able pass it to my guest Windows 11 (Proxmox 7.2) and with Quadro m6000 guest drivers I was able to make it work and get OK score in Heaven Benchmark. But every time I try to use it as vGPU i'm getting BSOD. I going to try to play with 24GB version probably next week.

dulasau commented 2 years ago

BTW, in order to get a video output even through Parsec or TightVPN i have to use my gtx950 as additional GPU, any workarounds with this?

haywoodspartan commented 2 years ago

Have you set the GPU from Compute to Graphics mode. Apparently it may or may not persist after reboots according to some people.

https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf This would need to be done on the host and individual virtual machines.

There is also the fact you need to have a virtual Display adapter on installed into windows. The parsec one can work fine in most cases.

TymanLS commented 2 years ago

@dulasau This guide may be helpful. The person in this video seems to be using the M40 in a physical machine instead of a VM, hence why they have to install their iGPU drivers. If you're planning to only connect remotely with Parsec, you shouldn't need to install any iGPU drivers.

dulasau commented 2 years ago

@TymanLS I'm using Ryzen 5900x so unfortunately no iGPU

dulasau commented 2 years ago

@haywoodspartan Yeah i did switched it to Graphics mode (it persists after reboots), although I've done this only on the guest machine, I don't even load host nvidia drivers since I wasn't able to make vGPU work. Or you mean I need to enable Graphics mode on the host to make vGPU work (I think I tried that and it didn't help). Parsec provides an optional virtual display for "headless" machines, but it didn't help either.

TymanLS commented 2 years ago

@dulasau If you're passing the M40 straight through to a VM (not using vGPU), then I don't think the host drivers matter since the host system shouldn't be able to access the card. When you say you have to use the GTX 950, are you also passing that through to the VM or are you leaving that connected to the host system? I remember successfully setting up a Windows 10 VM with Parsec connectivity only passing through the M40 and no other GPUs, so I'm curious why it wouldn't work for you.

dulasau commented 2 years ago

@TymanLS I'm passing thought my GTX950 directly to the VM. It's interesting that even thought I'm using spoofing (from Qemu) to trick that the card is Quadro M6000, Nvidia driver doesn't believe me, despite Windows device manager saying that the GPU is M6000 Nvidia driver (and games) say that it's M40. Maybe Nvidia driver knowing that M40 doesn't have video output blocks it somehow (the video output for Parsec, TightVNC, etc)? One time Nvidia driver (and RTX Experience) agreed that the card is M6000 but it probably was an older driver (I should try this again), but I didn't test Parsec at that time. One good thing from using GTX950 is that I can use Geforce Experience, I haven't had luck yet with Parsec client on Raspberry Pi 4 :)

republicus commented 2 years ago

@FallingSnow Yes, youre right. I have a 12 GB version that the seller said was last flashed with a TITAN X vbios. I'll see what, if anything, the vbios might do and report back any lessons learned.

@dulasau I am seeing the same behavior on Linux guests. The driver seems to recognize that it is a vGPU even when spoofed.

dulasau commented 2 years ago

Hmm.... ok the same BSOD with 24GB version, something is wrong .....

FallingSnow commented 2 years ago

Check your dmesg.

dulasau commented 2 years ago

Check your dmesg.

Am I looking for something specific?

FallingSnow commented 2 years ago

Any errors really about why vgpu might be failing.

dulasau commented 2 years ago

I don't see any errors related to vgpu