Problem with type 'C' instances

Fancyshirts commented 3 years ago

I'm having some trouble getting type C instances working. This is with a GV100, in Centos. Type Q instances work without issue. These type C instances are listed as available using mdevctl types, and I can create them using mdevctl without any problems.

The following appears in the logs when I try and start a VM with a C type instance attached:

Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 9ebb3727-9bfe-492a-8adf-4fe1d1381401 GPU PCI id 00:65:00.0 config params vgpu_type_id=312
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=312
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: notice: vmiop_env_log: Successfully updated env symbols!
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: error: vmiop_log: (0x0): Guest BAR1 is of invalid length (g: 0x200000000, h: 0x10000000)
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (error setting vGPU configuration information from RM)
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: error: vmiop_log: display_init failed for inst: 0
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
Jun 14 09:21:23 hostname nvidia-vgpu-mgr[66999]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1f

DualCoder commented 3 years ago

Interesting. Well this is the problem:

Guest BAR1 is of invalid length (g: 0x200000000, h: 0x10000000)

This means that the profile expects an 8GB BAR1, but the host only has a 256MB BAR1.

According to vgpuConfig.xml the V100D-8C profile requires an 8GB BAR1 and all Q profiles require a 256MB BAR1. So this would explain why the Q profiles work and the C profiles doesn't.

Now the thing is that the BAR sizes are initially setup by the BIOS, and 256MB is some kind of legacy limit. So maybe there is some option in the BIOS to disable the legacy support and get the whole 8GB. Another option could be to resize the BAR from the driver in Linux, but if you can do that then I would expect the driver to already do it. So maybe it requires support by the GPU itself, which would be a hard to solve problem.

If I have time I will try to reproduce and investigate the issue on my system later this week.

DualCoder commented 3 years ago

Ok, so I did look into this and I was able to reproduce the issue and see the same error message about the BAR1 length.

I also double checked the BAR1 bits and the card does indeed respond with 256MB for BAR1, so my comment about this being a legacy limit was probably incorrect. According to envytools the BAR1 length is decided by straps (resistors soldered onto the PCB), so it might be possible to fix this issue with a soldering pen. The strap value is provided to the driver and there is a software override option. And I can confirm that my card does report with a different BAR1 size after applying a software override. Unfortunately I was not able to reassign the BAR to the PCI bridge that it was behind due to that bridge not having enough space for the whole 8GB. The solution to this would be to also resize the PCI bridges, but that would likely cause issues for other PCI devices behind the same bridges if their drivers are already loaded and rely on the PCI addresses remaining constant.

Fancyshirts commented 3 years ago

Interesting - so I guess a real Tesla V100 has the straps set to a much larger BAR1 - the product brief (here) lists a 32GB BAR1.

In newish kernels there is now a resizable bar feature, I wonder if we could leverage that somehow? (see this)

DualCoder commented 3 years ago

Resizable bar is a standardized feature on PCIe, which means that it requires the PCIe device (GPU) to implement it according to the PCIe spec for it to work. The newer Ampere cards does this, and I believe resizing the bar using this feature could be a solution to the problem. However older cards does not support this and there are other issues with Ampere, that prevents it from working with vgpu_unlock.

I was able to resize the bridge hierarchy with the following patch to nv-pci.c:

click to show patch

``` --- a/nv-pci.c 2021-01-06 07:48:16.000000000 +0100 +++ b/nv-pci.c 2021-06-19 14:56:30.147809151 +0200 @@ -98,6 +98,174 @@ rm_init_dynamic_power_management(sp, nv, pr3_acpi_method_present); } +static uint64_t vgpu_unlock_get_bar_size(struct pci_dev *dev, int bar) +{ + uint32_t bar_lsb; + uint32_t bar_msb; + NvBool bar_64; + uint32_t size_lsb; + uint32_t size_msb; + + pci_read_config_dword(dev, PCI_BASE_ADDRESS_0 + sizeof(bar_lsb) * bar, &bar_lsb); + + if ((bar_lsb & 0x6) == 0x4) + { + /* 64-bit BAR */ + pci_read_config_dword(dev, PCI_BASE_ADDRESS_1 + sizeof(bar_msb) * bar, &bar_msb); + bar_64 = NV_TRUE; + } + else + { + /* 32-bit BAR */ + bar_msb = 0; + bar_64 = NV_FALSE; + } + + pci_write_config_dword(dev, PCI_BASE_ADDRESS_0 + sizeof(bar_lsb) * bar, 0xFFFFFFFF); + + if (bar_64) + { + pci_write_config_dword(dev, PCI_BASE_ADDRESS_1 + sizeof(bar_msb) * bar, 0xFFFFFFFF); + } + + pci_read_config_dword(dev, PCI_BASE_ADDRESS_0 + sizeof(size_lsb) * bar, &size_lsb); + + if (bar_64) + { + pci_read_config_dword(dev, PCI_BASE_ADDRESS_1 + sizeof(size_msb) * bar, &size_msb); + } + else + { + size_msb = 0xFFFFFFFF; + } + + pci_write_config_dword(dev, PCI_BASE_ADDRESS_0 + sizeof(bar_lsb) * bar, bar_lsb); + + if (bar_64) + { + pci_write_config_dword(dev, PCI_BASE_ADDRESS_1 + sizeof(bar_msb) * bar, bar_msb); + } + + return ~(((uint64_t)size_msb << 32) | (size_lsb & 0xFFFFFFF0)) + 1; +} + +static NvBool vgpu_unlock_setup_bar1(struct pci_dev *dev) +{ + uint64_t bar1_size; + void* bar0; + struct pci_bus *bus_iter; + int i; + + bar1_size = vgpu_unlock_get_bar_size(dev, 1); + + nv_printf(NV_DBG_SETUP, "NVRM: BAR1 size: %lluMB\n", bar1_size >> 20); + + if (bar1_size >= (8LL << 30)) + { + return NV_TRUE; + } + + nv_printf(NV_DBG_SETUP, "NVRM: BAR1 smaller than 8GB, resizing\n"); + + if (!pci_is_enabled(dev)) + { + if (pci_enable_device(dev) != 0) + { + nv_printf(NV_DBG_ERRORS, + "NVRM: pci_enable_device failed, aborting\n"); + return NV_FALSE; + } + } + + bar0 = ioremap_nocache(pci_resource_start(dev, 0), pci_resource_len(dev, 0)); + + if (!bar0) + { + nv_printf(NV_DBG_ERRORS, "NVRM: Failed to map BAR0, aborting\n"); + return NV_FALSE; + } + + { + volatile uint32_t *straps0 = (volatile uint32_t*)((uint8_t*)bar0 + 0x00101000); + volatile uint32_t *straps1 = (volatile uint32_t*)((uint8_t*)bar0 + 0x0010100C); + + nv_printf(NV_DBG_SETUP, "NVRM: straps0 = 0x%x\n", *straps0); + nv_printf(NV_DBG_SETUP, "NVRM: straps1 = 0x%x\n", *straps1); + + *straps1 = 0; /* Reset override. */ + *straps1 |= (1 << 31) | (5 << 20); /* Override BAR1 to 8GB. */ + } + + iounmap(bar0); + + bar1_size = vgpu_unlock_get_bar_size(dev, 1); + nv_printf(NV_DBG_SETUP, "NVRM: New BAR 1 size: %lluMB\n", bar1_size >> 20); + + if (bar1_size < (8LL << 30)) + { + nv_printf(NV_DBG_ERRORS, "NVRM: Failed to resize BAR1, aborting\n"); + return NV_FALSE; + } + + /* Update the kernel's registered size of the BAR1 resource. */ + pci_resource_end(dev, 1) = pci_resource_start(dev, 1) + bar1_size - 1; + + /* Relese everything on the device so it can be reassigned to a larger memory space. */ + pci_disable_device(dev); + pci_release_resource(dev, 0); + pci_release_resource(dev, 1); + pci_release_resource(dev, 3); + pci_release_resource(dev, 5); + pci_release_resource(dev, 6); + + bus_iter = dev->bus; + + if (pci_is_root_bus(bus_iter)) + { + /* + * In the special case when the device is connected directly to the root + * bus we don't need to worry about any bridges. + */ + pci_bus_assign_resources(bus_iter); + return NV_TRUE; + } + + /* + * Iterate up through the PCI hierarchy and release all bridges so they can + * be resized. + */ + while (1) + { + if (!pci_is_bridge(bus_iter->self)) + { + nv_printf(NV_DBG_ERRORS, "NVRM: Device is not a bridge, abortng\n"); + return NV_FALSE; + } + + if (pci_is_enabled(bus_iter->self)) + { + pci_disable_device(bus_iter->self); + } + + for (i = PCI_BRIDGE_RESOURCES; i < PCI_BRIDGE_RESOURCE_END; ++i) + { + pci_release_resource(bus_iter->self, i); + } + + if (pci_is_root_bus(bus_iter->parent)) + { + break; + } + + bus_iter = bus_iter->parent; + } + + /* Resize all bridges and reassign all resources. */ + pci_bus_size_bridges(bus_iter); + pci_bus_assign_resources(bus_iter->parent); + + return NV_TRUE; +} /* find nvidia devices and set initial state */ static int @@ -115,6 +283,15 @@ NvBool prev_nv_ats_supported = nv_ats_supported; NV_STATUS status; + /* Resizing of BAR1 to 8GB can be disabled here. */ +#if 1 + if (!vgpu_unlock_setup_bar1(pci_dev)) + { + goto failed; + } +#endif + + nv_printf(NV_DBG_SETUP, "NVRM: probing 0x%x 0x%x, class 0x%x\n", pci_dev->vendor, pci_dev->device, pci_dev->class); ```

I was not able to get any size above 8GB to work on my system. But this might be a limitation imposed by my hardware.

I also had to add the following to the vgpu_unlock script to get rid of the error message:

if(op_type == 0x20801803) {
    var bar1_len_ptr = this.argp.add(0x10).readPointer().add(0x1c);
    bar1_len_ptr.writeU16(8192);
}

After this the VM boots successfully with the P40-8C profile. And it passes some very limited rendering and compute tests that I have performed.

click to show logs

``` $ dmesg ... [ 7.709739] NVRM: BAR1 size: 256MB [ 7.709740] NVRM: BAR1 smaller than 8GB, resizing [ 7.709743] nvidia 0000:01:00.0: enabling device (0000 -> 0003) [ 7.709826] NVRM: straps0 = 0x400080 [ 7.709828] NVRM: straps1 = 0x2013000 [ 7.709857] NVRM: New BAR 1 size: 8192MB [ 7.709883] nvidia 0000:01:00.0: BAR 0: releasing [mem 0xf6000000-0xf6ffffff] [ 7.709884] nvidia 0000:01:00.0: BAR 1: releasing [mem 0xe0000000-0x2dfffffff 64bit pref] [ 7.709884] nvidia 0000:01:00.0: BAR 3: releasing [mem 0xf0000000-0xf1ffffff 64bit pref] [ 7.709885] nvidia 0000:01:00.0: BAR 5: releasing [io 0xe000-0xe07f] [ 7.709885] nvidia 0000:01:00.0: BAR 6: releasing [mem 0xf7000000-0xf707ffff pref] [ 7.709920] pcieport 0000:00:01.0: BAR 13: releasing [io 0xe000-0xefff] [ 7.709921] pcieport 0000:00:01.0: BAR 14: releasing [mem 0xf6000000-0xf70fffff] [ 7.709921] pcieport 0000:00:01.0: BAR 15: releasing [mem 0xe0000000-0xf1ffffff 64bit pref] [ 7.709931] pcieport 0000:00:01.0: BAR 15: assigned [mem 0x800000000-0xaffffffff 64bit pref] [ 7.709933] pcieport 0000:00:01.0: BAR 14: assigned [mem 0xe0000000-0xe17fffff] [ 7.709935] pcieport 0000:00:01.0: BAR 13: assigned [io 0x2000-0x2fff] [ 7.709937] nvidia 0000:01:00.0: BAR 1: assigned [mem 0x800000000-0x9ffffffff 64bit pref] [ 7.709941] nvidia 0000:01:00.0: BAR 3: assigned [mem 0xa00000000-0xa01ffffff 64bit pref] [ 7.709944] nvidia 0000:01:00.0: BAR 0: assigned [mem 0xe0000000-0xe0ffffff] [ 7.709946] nvidia 0000:01:00.0: BAR 6: assigned [mem 0xe1000000-0xe107ffff pref] [ 7.709947] nvidia 0000:01:00.0: BAR 5: assigned [io 0x2000-0x207f] [ 7.709949] pcieport 0000:00:01.0: PCI bridge to [bus 01] [ 7.709949] pcieport 0000:00:01.0: bridge window [io 0x2000-0x2fff] [ 7.709951] pcieport 0000:00:01.0: bridge window [mem 0xe0000000-0xe17fffff] [ 7.709952] pcieport 0000:00:01.0: bridge window [mem 0x800000000-0xaffffffff 64bit pref] [ 7.709956] pci 0000:03:00.0: PCI bridge to [bus 04] [ 7.709971] pci 0000:00:1c.3: PCI bridge to [bus 03-04] [ 7.709978] NVRM: probing 0x10de 0x1b00, class 0x30000 [ 7.710036] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 7.710070] NVRM: PCI:0000:01:00.0 (10de:1b00): BAR0 @ 0xe0000000 (16MB) [ 7.710071] NVRM: PCI:0000:01:00.0 (10de:1b00): BAR1 @ 0x800000000 (8192MB) ... $ lspci -v ... 01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN X] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation GP102 [TITAN X] Flags: bus master, fast devsel, latency 0, IRQ 33 Memory at e0000000 (32-bit, non-prefetchable) [size=16M] Memory at 800000000 (64-bit, prefetchable) [size=8G] Memory at a00000000 (64-bit, prefetchable) [size=32M] I/O ports at 2000 [size=128] [virtual] Expansion ROM at e1000000 [disabled] [size=512K] ... ```

questnode commented 1 year ago

Hi. I'm seeing the same issue on my proxmox set up. I apologize for the trouble, but this post here is the only hit returned when I try to search for a solution. I've read through your patch. But I don't know how to implement it. I'm hoping that I could be pointed in the right direction. Thanks so much.

My main system is a Haswell i7-4970k on Z97 motherboard. My graphics card is a Tesla P4. The driver version is 510.108.03. The profile that I'm using is GRID P40-12C (nvidia-286) I've also made the following profile override

[profile.nvidia-286]
num_displays = 1
display_width = 3840
display_height = 2160
max_pixels = 8294400
cuda_enabled = 1
frl_enabled = 60
framebuffer = 3937053354
pci_id = 0x1B3011A0
pci_device_id = 0x1B30

This is the error that I'm seeing when I try to start the VM.

Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: (0x0): Guest BAR1 is of invalid length (g: 0x400000000, h: 0x10000000)
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (error setting vGPU configuration information from RM)
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_log: display_init failed for inst: 0
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
Jan 16 09:35:20 pve nvidia-vgpu-mgr[2505]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1f

DualCoder / vgpu_unlock

Problem with type 'C' instances #55