Open Fancyshirts opened 3 years ago
Interesting. Well this is the problem:
Guest BAR1 is of invalid length (g: 0x200000000, h: 0x10000000)
This means that the profile expects an 8GB BAR1, but the host only has a 256MB BAR1.
According to vgpuConfig.xml
the V100D-8C profile requires an 8GB BAR1 and all Q profiles require a 256MB BAR1. So this would explain why the Q profiles work and the C profiles doesn't.
Now the thing is that the BAR sizes are initially setup by the BIOS, and 256MB is some kind of legacy limit. So maybe there is some option in the BIOS to disable the legacy support and get the whole 8GB. Another option could be to resize the BAR from the driver in Linux, but if you can do that then I would expect the driver to already do it. So maybe it requires support by the GPU itself, which would be a hard to solve problem.
If I have time I will try to reproduce and investigate the issue on my system later this week.
Ok, so I did look into this and I was able to reproduce the issue and see the same error message about the BAR1 length.
I also double checked the BAR1 bits and the card does indeed respond with 256MB for BAR1, so my comment about this being a legacy limit was probably incorrect. According to envytools the BAR1 length is decided by straps (resistors soldered onto the PCB), so it might be possible to fix this issue with a soldering pen. The strap value is provided to the driver and there is a software override option. And I can confirm that my card does report with a different BAR1 size after applying a software override. Unfortunately I was not able to reassign the BAR to the PCI bridge that it was behind due to that bridge not having enough space for the whole 8GB. The solution to this would be to also resize the PCI bridges, but that would likely cause issues for other PCI devices behind the same bridges if their drivers are already loaded and rely on the PCI addresses remaining constant.
Resizable bar is a standardized feature on PCIe, which means that it requires the PCIe device (GPU) to implement it according to the PCIe spec for it to work. The newer Ampere cards does this, and I believe resizing the bar using this feature could be a solution to the problem. However older cards does not support this and there are other issues with Ampere, that prevents it from working with vgpu_unlock.
I was able to resize the bridge hierarchy with the following patch to nv-pci.c:
I was not able to get any size above 8GB to work on my system. But this might be a limitation imposed by my hardware.
I also had to add the following to the vgpu_unlock script to get rid of the error message:
if(op_type == 0x20801803) {
var bar1_len_ptr = this.argp.add(0x10).readPointer().add(0x1c);
bar1_len_ptr.writeU16(8192);
}
After this the VM boots successfully with the P40-8C profile. And it passes some very limited rendering and compute tests that I have performed.
Hi. I'm seeing the same issue on my proxmox set up. I apologize for the trouble, but this post here is the only hit returned when I try to search for a solution. I've read through your patch. But I don't know how to implement it. I'm hoping that I could be pointed in the right direction. Thanks so much.
My main system is a Haswell i7-4970k on Z97 motherboard. My graphics card is a Tesla P4. The driver version is 510.108.03. The profile that I'm using is GRID P40-12C (nvidia-286) I've also made the following profile override
[profile.nvidia-286]
num_displays = 1
display_width = 3840
display_height = 2160
max_pixels = 8294400
cuda_enabled = 1
frl_enabled = 60
framebuffer = 3937053354
pci_id = 0x1B3011A0
pci_device_id = 0x1B30
This is the error that I'm seeing when I try to start the VM.
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: (0x0): Guest BAR1 is of invalid length (g: 0x400000000, h: 0x10000000)
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (error setting vGPU configuration information from RM)
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: display_init failed for inst: 0
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1f
I'm having some trouble getting type C instances working. This is with a GV100, in Centos. Type Q instances work without issue. These type C instances are listed as available using
mdevctl types
, and I can create them usingmdevctl
without any problems.The following appears in the logs when I try and start a VM with a C type instance attached: