Passthrough of more than 2 GPUs doesn't work

dhruvik7 commented 1 year ago

When I increase the number of GPUs to passthrough in my VM spec, it maxes out at 2 GPUs actually being accessible (in lspci) inside the VM, although Kubernetes marks the correct number of GPU requests/limits on the node. Is there a limit of 2 GPUs being passed through or am I doing something wrong?

rthallisey commented 1 year ago

There shouldn't be a limit. What does your VMI/VM spec look like? And can you share the allocatable and capacity values for your node?

dhruvik7 commented 1 year ago

Here's the VM spec

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  creationTimestamp: 2018-07-04T15:03:08Z
  generation: 1
  labels:
    kubevirt.io/os: linux
  name: vm-quad-test
spec:
  running: true
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/domain: vm-quad-test
    spec:
      domain:
        cpu:
          cores: 4
        devices:
          disks:
          - disk:
              bus: virtio
            name: disktrio
          - disk:
              bus: virtio
            name: cloudinitdisk
          gpus:
            - name: gpu1
              deviceName: nvidia.com/GA102_GEFORCE_RTX_3090
            - name: gpu2
              deviceName: nvidia.com/GA102_GEFORCE_RTX_3090
            - name: gpu3
              deviceName: nvidia.com/GA102_GEFORCE_RTX_3090
            - name: gpu4
              deviceName: nvidia.com/GA102_GEFORCE_RTX_3090
        machine:
          type: q35
        resources:
          requests:
            memory: 8192M
            nvidia.com/GA102_GEFORCE_RTX_3090: 4
          limits:
            nvidia.com/GA102_GEFORCE_RTX_3090: 4
      volumes:
      - dataVolume:
          name: ubuntu-dv-quad
        name: disktrio
      - name: cloudinitdisk
        cloudInitNoCloud:
          networkData: |
            version: 2
            ethernets:
              enp1s0:
                dhcp4: true
              enp1s1:
                dhcp4: true
          userData: |
            #cloud-config
            password: ubuntu
            chpasswd: { expire: False }
            ssh_pwauth: True
  dataVolumeTemplates:
  - metadata:
      name: ubuntu-dv-quad
    spec:
      pvc:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 35Gi
      source:
        registry:
          url: "docker://tedezed/ubuntu-container-disk:20.0"

Here's the allocatable and capacity from the node description Capacity: cpu: 256 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 1134737624Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 247414732Ki nvidia.com/GA102_GEFORCE_RTX_3090: 8 nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 255900m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 1045774192547 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 247050188Ki nvidia.com/GA102_GEFORCE_RTX_3090: 8 nvidia.com/gpu: 0 pods: 110

Running lspci in the resulting VM shows only 2 GPUs passed through 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1) Subsystem: ASUSTeK Computer Inc. Device [1043:87d5] Kernel modules: nvidiafb 07:00.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1) Subsystem: ASUSTeK Computer Inc. Device [1043:87d5] 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1) Subsystem: ASUSTeK Computer Inc. Device [1043:87d5] Kernel modules: nvidiafb 09:00.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1) Subsystem: ASUSTeK Computer Inc. Device [1043:87d5]

dhruvik7 commented 1 year ago

Update: looks like it passes through half of the GPUs that are requested. When I adjust the above manifest to request 8 GPUs, 4 GPUs are passed through.

aelbarkani commented 1 year ago

Same here, on Nvidia RTX A5000. It seems like it counts audio devices. Here are the logs from the device plugin installed through the gpu operator:

oc logs nvidia-sandbox-device-plugin-daemonset-2ngnl
Defaulted container "nvidia-sandbox-device-plugin-ctr" out of: nvidia-sandbox-device-plugin-ctr, vfio-pci-validation (init), vgpu-devices-validation (init)
2023/09/16 10:55:31 Not a device, continuing
2023/09/16 10:55:31 Nvidia device  0000:31:00.0
2023/09/16 10:55:31 Iommu Group 7
2023/09/16 10:55:31 Device Id 2231
2023/09/16 10:55:31 Nvidia device  0000:31:00.1
2023/09/16 10:55:31 Iommu Group 7
2023/09/16 10:55:31 Nvidia device  0000:4b:00.0
2023/09/16 10:55:31 Iommu Group 5
2023/09/16 10:55:31 Device Id 2231
2023/09/16 10:55:31 Nvidia device  0000:4b:00.1
2023/09/16 10:55:31 Iommu Group 5
2023/09/16 10:55:31 Nvidia device  0000:b1:00.0
2023/09/16 10:55:31 Iommu Group 21
2023/09/16 10:55:31 Device Id 2231
2023/09/16 10:55:31 Nvidia device  0000:b1:00.1
2023/09/16 10:55:31 Iommu Group 21
2023/09/16 10:55:31 Nvidia device  0000:ca:00.0
2023/09/16 10:55:31 Iommu Group 19
2023/09/16 10:55:31 Device Id 2231
2023/09/16 10:55:31 Nvidia device  0000:ca:00.1
2023/09/16 10:55:31 Iommu Group 19
2023/09/16 10:55:31 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory
2023/09/16 10:55:31 Iommu Map map[19:[{0000:ca:00.0} {0000:ca:00.1}] 21:[{0000:b1:00.0} {0000:b1:00.1}] 5:[{0000:4b:00.0} {0000:4b:00.1}] 7:[{0000:31:00.0} {0000:31:00.1}]]
2023/09/16 10:55:31 Device Map map[2231:[7 5 21 19]]
2023/09/16 10:55:31 vGPU Map  map[]
2023/09/16 10:55:31 GPU vGPU Map  map[]
2023/09/16 10:55:31 DP Name GA102GL_RTX_A5000
2023/09/16 10:55:31 Devicename GA102GL_RTX_A5000
2023/09/16 10:55:31 GA102GL_RTX_A5000 Device plugin server ready
2023/09/16 10:55:31 healthCheck(GA102GL_RTX_A5000): invoked
2023/09/17 15:04:55 In allocate
2023/09/17 15:04:55 Allocated devices map[PCI_RESOURCE_NVIDIA_COM_GA102GL_RTX_A5000:0000:31:00.0,0000:31:00.1,0000:4b:00.0,0000:4b:00.1,0000:b1:00.0,0000:b1:00.1,0000:ca:00.0,0000:ca:00.1]

aelbarkani commented 1 year ago

@dhruvik7 did you find a solution for this ?

aelbarkani commented 1 year ago

cc @rthallisey

Just to give more context on the issue: I have 4 NVIDIA RTX A5000 on my servers. This happens in the following scenarios:

If I request 1 GPU, I see 1 device in the VM (the VGA card)
If I request 2 GPU, I see 2 devices in the VM (1 VGA card + 1 audio)
If I request 3 GPU, I see 3 devices in the VM (2 VGA cards + 1 audio)
If I request 4 GPU, I see 4 devices in the VM (2 VGA cards + 2 audio)

I tried to request 8 GPU so I can get actually 4 in the VM, but that of course doesn't work since Kubevirt is saying that I don't have 8 GPUs in any of the nodes.

dhruvik7 commented 1 year ago

Hey, no I didn't end up pursuing a solution here, just used different machines. I think if you can split the VGA cards and audio into separate IOMMU groups it should work

rthallisey commented 1 year ago

Yes you'll need to place GPUs and audio devices in different IOMMU groups.

aelbarkani commented 12 months ago

Actually I think kubevirt-gpu-device-plugin is the component that places GPU and audio devices in the same IOMMU groups. Is there an option the plugin or in gpu-operator that allows me to place them in different IOMMU groups @rthallisey ?

aelbarkani commented 12 months ago

I had to disable completely gpu-operator and use hostdevices to make the GPUs work. So if you have a solution for it it would be great.

NVIDIA / kubevirt-gpu-device-plugin

Passthrough of more than 2 GPUs doesn't work #67