Tesla M40 Problems & Memory Allocation Limit with Tesla M40 24GB -> Tesla M60 remapping

BlaringIce commented 3 years ago

First and primary: I'm coming from a setup where I was using a GTX 1060 with vgpu_unlock just fine, but figured I'd step it up so that I could support more VMs. So, I'm currently trying to use a Tesla M40. Being a Tesla card, you might expect not to need vgpu_unlock, but this is one of the few Tesla's that doesn't support it natively. So, I'm trying to use nvidia-18 types from the M60 profiles with my VMs. I'm aware that I should be using a slightly older driver to match my host driver. However, I'm still getting a code 43 when I load my guest. I would provide some logs here, but I'm not sure what I can include since the entries for the two vgpu services both seem to be fine with no errors other than nvidia-vgpu-mgr[2588]: notice: vmiop_log: display_init inst: 0 successful at the end of trying to initialize the mdev device when the VM starts up. Please let me know any other information that I can provide to help debug/troubleshoot. Second: This is probably one of the few instances where this is a problem since most GeForce/Quadro cards have less memory than their vGPU capable counterparts. However, I have a Tesla M40 GPU that has 24 GB of vRAM (in two separate memory regions I would guess, although this SKU isn't listed on the Nvidia graphics processing units Wikipedia page, so I'm not 100% sure). This is in comparison to the Tesla M60's 2x8GB configuration, of which, only 8GB is available for allocation in vGPU. I'm not sure whether the max_instance quantity, as seen in mdevctl types, is defined on the Nvidia driver side, in the vgpu_unlock side, or if it's a mix and the vgpu_unlock side might be able to do something about it. What I'm asking here, though, is whether this value can be redefined so that I can utilize all 24 GB of my available vRAM or, if not that, then at least the 12 GB that I presume is available in the GPU's primary memory.

angst911 commented 2 years ago

I have a Tesla M40 working well, using both this repo and and vgpu_unlock_rs. 510.47.03 Driver on Proxmox host.

Like many people have done, my working config passes through to Windows guests an NVIDIA Quadro M6000. It works great. I do not experience any error 43 issues or problems with performance or drivers on Windows 10 or 11.

What brought me here was my attempt to get Linux guests to enjoy the same benefits.

After some tweaks the only way I can get linux working at all is to pass-through a specific GRID device. An unchanged device with no PCI ID changes passed through an M60 -- which would not work with any proprietary NVIDIA drivers.

After changing the PCI IDs -- the Linux guest works great until the official driver goes into limp mode (triggered at 20 minutes uptime and slows the freq and sets a 15 FPS cap). I observe the same performance with the Windows driver going into limp mode when using the unlicensed official vGPU driver for Windows.

The PCI ID that works in Linux guests: [code]# PCI ID for GRID M60 0B #pci_id = 0x13F2114E #pci_device_id = 0x13F2[/code]

It would appear, unlike the Windows drivers, that the Linux proprietary drivers for Quadro and Tesla/Compute cards do not share the same instructions for vGPU capabilities. I have tried a series of different PCI IDs and drivers with no joy.

I'd love to know what steps/process you followed. I've been beating my head against the wall for 2 days now on this project. I've got two M40's that I'm trying to use as as vGPU (this mod plus -RS). Thinks "look" right, but I always get Error 43. I'm using the same driver version, and Proxmox 7.2

Can you share you VM config also?

angst911 commented 2 years ago

Where did you get the patches for the kernel versions?

dulasau commented 2 years ago

I'd love to know what steps/process you followed. I've been beating my head against the wall for 2 days now on this project. I've got two M40's that I'm trying to use as as vGPU (this mod plus -RS). Thinks "look" right, but I always get Error 43. I'm using the same driver version, and Proxmox 7.2

Can you share you VM config also?

+1

angst911 commented 2 years ago

Make sure secure boot is disabled in the UEFI BIOS The story that got me to this....

I followed this guide originally https://wvthoog.nl/proxmox-7-vgpu-v2/ using the pre-patched Everything worked except Error 43.. then swapped over to using the video guide from Craft Computer (https://www.youtube.com/watch?v=jTXPMcBqoi8&t=1626s)

I had all sorts of fun manually patching the 510 driver set for the 5.15 kernel, which maybe I didn't need to...

just about gave up and decided to do a debian VM, disabled the custom profiles (by renaming the toml file at /etc/vgpu_profiles) and stopped spoofing to a quadro M6000 and installed the grid driver in debian, which got me into errors about not being able to load the drm module, which led me to disabling secure boot.... did the same in windows (after having to expand my partition)... and magic... working with the Grid driver. Turned my custom profiles back on, uninstalled the grid driver, reinstalled the quadro desktop drivers.... now I'm at Error 31.. So, progress?

angst911 commented 2 years ago

ok, now back to error 43 with the quadro drivers, but, this is still progress. I was getting error 43 with the GRID drivers previously also

republicus commented 2 years ago

@dulasau @angst911

I just want to point out again that I have the 24GB version of the Tesla M40. Earlier others indicated the problem may be related to the 12GB version only.

I can give more details if this isn't enough to get you going. Let me know how it goes.

First I installed the vgpu_unlock script onto my proxmox host.
Secondly, I like how vgpu_unlock-rs complements this repo. So I setup vgpu_unlock-rs onto the proxmox host as well.

Beyond that there are very few specific configurations needed for the VM.

Configuration changes to vm config: Add line args: -uuid 00000000-0000-0000-0000-000000000XXX where XXX = VMID

Add your hardware to the VM in GUI. I used MDev Type nvidia-12 or whichever you wish as reported by mdevctl types and has available instances.

I then changed made changes to the MDev Type by creating/editing /etc/vgpu_unlock/profile_override.toml

[profile.nvidia-12] num_displays = 1 display_width = 3840 display_height = 2160 max_pixels = 8294400 cuda_enabled = 1 frl_enabled = 144 framebuffer = 5905580032 pci_id = 0x17F011A0 pci_device_id = 0x17F0

This was enough to get my Tesla M40 vgpu profile working in Windows 10/11. The device is spoofed as an Quadro M6000 and I increased most the MDev profile to test its capabilities (which I currently game in 4K daily with this working profile)

angst911 commented 2 years ago

``> @dulasau @angst911

I just want to point out again that I have the 24GB version of the Tesla M40. Earlier others indicated the problem may be related to the 12GB version only.

I can give more details if this isn't enough to get you going. Let me know how it goes.

First I installed the vgpu_unlock script onto my proxmox host.

Secondly, I like how vgpu_unlock-rs complements this repo. So I setup vgpu_unlock-rs onto the proxmox host as well.

Beyond that there are very few specific configurations needed for the VM.

Configuration changes to vm config: Add line args: -uuid 00000000-0000-0000-0000-000000000XXX where XXX = VMID

Add your hardware to the VM in GUI. I used MDev Type nvidia-12 or whichever you wish as reported by mdevctl types and has available instances.

I then changed made changes to the MDev Type by creating/editing /etc/vgpu_unlock/profile_override.toml

[profile.nvidia-12] num_displays = 1 display_width = 3840 display_height = 2160 max_pixels = 8294400 cuda_enabled = 1 frl_enabled = 144 framebuffer = 5905580032 pci_id = 0x17F011A0 pci_device_id = 0x17F0

This was enough to get my Tesla M40 vgpu profile working in Windows 10/11. The device is spoofed as an Quadro M6000 and I increased most the MDev profile to test its capabilities (which I currently game in 4K daily with this working profile)

@republicus What version of proxmox, kernel, and nvidia driver are you on (both host and guest)? -- Note I can see the 512.78 in the screenshot for the guest -- Can you provide a link to that download, I wasn't able to find that on NVIDIA's site.

Which VM Type machine type and Bios/UEFI did you use?

Did you 100% follow the vgpu_unlock instructions, or did you follow the modified instructions for using it with vgpu_unlock?

I'm at the point where the GRID driver works, but error 43 if I used the quadro driver and spoof the device ID Proxmox 7.2, Kenrnel 5.15 VGPU_unlock + vgpu)unlock_rs (Driver patched to include SRC and kbuild config line prior to running nvidia installer) Host Driver: NVIDIA-Linux-x86_64-510.47.03-vgpu-kvm.run manually integrating kennel related driver patched Guest Driver: 511.65_grid_win10_win11_server2016_server2019_server2022_64bit_international

Working grid vgpu_profile.tom [profile.nvidia-18] num_displays = 1 display_width = 1920 display_height = 1080 max_pixels = 2073600 cuda_enabled = 1 frl_enabled = 60 framebuffer = 5905580032

and the profile that doesn't work when spoofing to a M6000 [profile.nvidia-18] num_displays = 1 display_width = 1920 display_height = 1080 max_pixels = 2073600 cuda_enabled = 1 frl_enabled = 60 framebuffer = 5905580032 pci_id = 0x17F011A0 pci_device_id = 0x17F0

dulasau commented 2 years ago

I have both 12GB and 24GB versions and the problems seems to be consistent across both of them.

republicus commented 2 years ago

I first installed and had it working on my PVE 7.1 node but had a failure with my boot drive recently. I swapped in my backup drive which is currently running PVE 6.4 Kernel Version Linux 5.4.195-1-pve

I'll work on updating the node back to PVE 7.2+

Host grid driver: 510.47.03

You can DM me on Discord if you wish:

Show

` Republicus#2744 `

@angst911 The NVIDIA Advanced Driver Search seems to be less "advanced" than the ordinary search - I'm seeing only old drivers listed (latest 473.81) using it.

Here is a direct link to that driver: NVIDIA RTX / QUADRO DESKTOP AND NOTEBOOK DRIVER RELEASE 510

dulasau commented 2 years ago

It's working!!!!! Although not 100% sure exactly why :-D

I see hours of testing ahead, but here is what I have so far:

It works on my "new" server (two e5-2698v3 and supermicro x10dri-t4+)
It didn't work (one of things I'll test later) on my "old" "server" (Ryzen 5900x + ASRock x570d4u)
Tesla m40 24gb. Going to try 12gb version tonight.
Host OS: Proxmox 7.2-7 (kernel 5.15.39)
Host driver: 510.85.03
Guest OS: Two VMs with Win11
Guest driver: 512.78 (from the post above). Was getting code 43 with 471.41

I was following this setup/config instruction https://gitlab.com/polloloco/vgpu-proxmox and profile config override from here https://drive.google.com/drive/folders/1KHf-vxzUCGqsWZWOW0bXCvMhXh5EJxQl (Jeff from Craft Computing).

dulasau commented 2 years ago

Just in case here is profile override:

[profile.nvidia-18] num_displays = 1 display_width = 1920 display_height = 1080 max_pixels = 2073600 cuda_enabled = 1 frl_enabled = 60 framebuffer = 11811160064 pci_id = 0x17F011A0 pci_device_id = 0x17F0

VM config:

args: -uuid 00000000-0000-0000-0000-000000000104 balloon: 0 bios: ovmf boot: order=ide0;ide2;net0 cores: 8 cpu: host efidisk0: local-lvm:vm-104-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M hostpci0: 0000:81:00.0,mdev=nvidia-18,pcie=1 ide0: local-lvm:vm-104-disk-1,size=64G ide2: NetworkBackup:iso/Win11_English_x64v1.iso,media=cdrom,size=5434622K machine: pc-q35-7.0 memory: 12288 meta: creation-qemu=7.0.0,ctime=1662489026 name: Win11-3 net0: e1000=16:AB:A7:2D:FB:4B,bridge=vmbr0,firewall=1 numa: 0 ostype: win11 scsihw: virtio-scsi-pci smbios1: uuid=b560b92f-f856-487e-bb00-a2e495665b59 sockets: 1 tpmstate0: local-lvm:vm-104-disk-2,size=4M,version=v2.0 vga: none vmgenid: 1fa5368d-a7d0-403b-ac65-e033af2de62a

republicus commented 2 years ago

Thats great! Hope to hear good news about the Tesla M40 12GB

dulasau commented 2 years ago

Tesla M40 12gb works as well. Changed profile override to ~6gb and was able to start two VMs Screenshot from 2022-09-06 16-46-47

dulasau commented 2 years ago

Alrighty, I tested Tesla M40 12GB on my Ryzen based "server" and now it's working! The only change from my unsuccessful previous attempts is that I have freshly installed Proxmox (although the same 7.2 version) on it (I was rebuilding my homelab) and probably guest nvidia driver 512.78 (i don't remember which driver version I was using before).

DualCoder / vgpu_unlock

Tesla M40 Problems & Memory Allocation Limit with Tesla M40 24GB -> Tesla M60 remapping #62