Quansight / open-gpu-server

The Open GPU Server for CI purpose.
8 stars 12 forks source link

Support running upto 6 concurrent VMs with GPU #7

Closed aktech closed 1 year ago

aktech commented 1 year ago

In the GPU server we have 6 GPUs:

27:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
28:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
43:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
44:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
c3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
c4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau

Currently pairs for 2 GPUs each are in same IOMMU groups

IOMMU Group 19 27:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 19 28:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 32 43:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 32 44:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 87 c3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 87 c4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)

You can see above there are 3 IOMMU groups with one 2 GPUs each, but unfortunately the 3rd GPU (c3:00.0 and c4:00.0 ) is not detected by nvidia drivers, although they are able to attach to a VM, this might just be a software issue, I haven't had the time to look.

So in a nutshell, currently we can only run 2 concurrent VMs with GPUs

The PCI devices needs to be reshuffled because devices in the same IOMMU groups, can't be attached to different VMs at the same time.

Relevant links:

aktech commented 1 year ago

I have contacted Metrostar to schedule a call to shuffle GPU Cards. It is scheduled for: Thursday, January 26⋅7:30 – 8:15pm UTC

aktech commented 1 year ago

Outcome

We managed to create one more IOMMU group after reshuffling one of the GPUs:

$ lspci -nnk | grep -i  NVIDIA
27:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
28:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
44:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
a3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
c3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
c4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
    Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
    Kernel modules: nvidiafb, nouveau
IOMMU Group 19 27:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 19 28:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 32 44:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 75 a3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 87 c3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 87 c4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)

So now we have 4 separate groups and upto 4 parallel VMs with GPUs can be created.