NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

Rework MPS limit normalization #11

Closed elezar closed 6 months ago

elezar commented 10 months ago

With this change we always specify limits in terms of UUIDs when passing these to the MPS control daemon. We also check for valid indices.

Using this we see:

spec:
  containers:
  - args:
    - |-
      set -e
      rm -f /var/log/nvidia-mps/startup.log

      nvidia-cuda-mps-control -d
      echo set_default_active_thread_percentage 50 | nvidia-cuda-mps-control
      echo set_default_device_pinned_mem_limit GPU-f22fb098-d1b3-3806-2655-ba25f02229c1 10240M | nvidia-cuda-mps-control

      echo "startup complete" > /var/log/nvidia-mps/startup.log

      tail -n +1 -f /var/log/nvidia-mps/control.log
    command:
    - chroot
    - /driver-root
    - sh
    - -c
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: GPU-f22fb098-d1b3-3806-2655-ba25f02229c1

Assuming the following claim parameters:

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi

and

spec:
  containers:
  - args:
    - |-
      set -e
      rm -f /var/log/nvidia-mps/startup.log

      nvidia-cuda-mps-control -d
      echo set_default_active_thread_percentage 50 | nvidia-cuda-mps-control
      echo set_default_device_pinned_mem_limit GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e 5120M | nvidia-cuda-mps-control

      echo "startup complete" > /var/log/nvidia-mps/startup.log

      tail -n +1 -f /var/log/nvidia-mps/control.log
    command:
    - chroot
    - /driver-root
    - sh
    - -c
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e

when using:

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi
      defaultPerDevicePinnedMemoryLimit:
         0: 5Gi