NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

Waiting for gpu node to be ready before scheduling pods using NVML #615

Open easyrider14 opened 11 months ago

easyrider14 commented 11 months ago

Hi everyone

I face an issue with gpu-operator and scaling of my K8S cluster When adding a GPU node to cluster, gpu-operator will, amon others things, install container runtime and drivers I've got a daemonset which uses NVML, but it is scheduled on the newly added gpu node as soon as it is available. But the driver is not ready, and initializing NVML fails. The container in my pod exits, but the pod is restarted and not deleted/created, so NVML initialization still fails. Which criteria should I use in mu daemonset definition to make sur my pod will be able to initialize NVML and run correctly when it will be scheduled on the node ?

Thanks

tariq1890 commented 11 months ago

You can add consider adding the gpu-operator-validator as an init container to your daemonset. This way, the daemonset would block on the nvidia-driver-daemonset transitioning to the Ready/Running state

Sample snippet

      initContainers:
      - name: driver-validation
        image: "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0"
        imagePullPolicy: IfNotPresent
        command: ['sh', '-c']
        args: ["nvidia-validator"]
        env:
          - name: WITH_WAIT
            value: "true"
          - name: COMPONENT
            value: driver
        securityContext:
          privileged: true
          seLinuxOptions:
            level: "s0"
        volumeMounts:
          - name: driver-install-path
            mountPath: /run/nvidia/driver
            mountPropagation: HostToContainer
          - name: run-nvidia-validations
            mountPath: /run/nvidia/validations
            mountPropagation: Bidirectional
          - name: host-root
            mountPath: /host
            readOnly: true
            mountPropagation: HostToContainer
          - name: host-dev-char
            mountPath: /host-dev-char
cdesiniotis commented 11 months ago

@easyrider14 is your pod requesting a GPU using resource requests/limits (e.g. requesting an nvidia.com/gpu resource)? This is the recommended way for requesting GPUs in Kubernetes and would solve this issue. The pod would not get scheduled on the newly added node until the GPU device-plugin is up and running (which only starts after both the NVIDIA driver and NVIDIA Container Toolkit are installed). If using resource requests/limits is not an option for you, then something along the lines of what @tariq1890 suggested would work.

easyrider14 commented 11 months ago

You can add consider adding the gpu-operator-validator as an init container to your daemonset. This way, the daemonset would block on the nvidia-driver-daemonset transitioning to the Ready/Running state

Hi @tariq1890

I've tried this after digging in gpu-operator manifest files, but still have the same result I've made a simple test with an initContainer simply waiting for 10 minutes before exiting (a simple sleep 10m on an alpine image) When the initContainer exists, my container is run but still fails with no access to NVML. I thought the container would be created after the initContainer finishes but it does not seem to be the case. The container is like created but not starte until the initContainer terminates. If I delete the pod, the container is recreated and restarted and has directly access to NVML

@cdesiniotis I don't need/want the resources to be reserved for this pod, as it is mainly keeping a state of available resources on the node in an ETCD database. This is no workload running continuously, just a regular update of available ram/cpu/gpu at regular intervals. I don't want to reserve and block resources for that