Closed aktech closed 1 year ago
I have contacted Metrostar to schedule a call to shuffle GPU Cards. It is scheduled for: Thursday, January 26⋅7:30 – 8:15pm UTC
We managed to create one more IOMMU group after reshuffling one of the GPUs:
$ lspci -nnk | grep -i NVIDIA
27:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
Kernel modules: nvidiafb, nouveau
28:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
Kernel modules: nvidiafb, nouveau
44:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
Kernel modules: nvidiafb, nouveau
a3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
Kernel modules: nvidiafb, nouveau
c3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
Kernel modules: nvidiafb, nouveau
c4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1214]
Kernel modules: nvidiafb, nouveau
IOMMU Group 19 27:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 19 28:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 32 44:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 75 a3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 87 c3:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
IOMMU Group 87 c4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
So now we have 4 separate groups and upto 4 parallel VMs with GPUs can be created.
In the GPU server we have 6 GPUs:
Currently pairs for 2 GPUs each are in same IOMMU groups
You can see above there are 3 IOMMU groups with one 2 GPUs each, but unfortunately the 3rd GPU (
c3:00.0
andc4:00.0
) is not detected by nvidia drivers, although they are able to attach to a VM, this might just be a software issue, I haven't had the time to look.So in a nutshell, currently we can only run 2 concurrent VMs with GPUs
The PCI devices needs to be reshuffled because devices in the same IOMMU groups, can't be attached to different VMs at the same time.
Relevant links: