Open easyrider14 opened 11 months ago
You can add consider adding the gpu-operator-validator
as an init container to your daemonset. This way, the daemonset would block on the nvidia-driver-daemonset transitioning to the Ready/Running state
Sample snippet
initContainers:
- name: driver-validation
image: "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0"
imagePullPolicy: IfNotPresent
command: ['sh', '-c']
args: ["nvidia-validator"]
env:
- name: WITH_WAIT
value: "true"
- name: COMPONENT
value: driver
securityContext:
privileged: true
seLinuxOptions:
level: "s0"
volumeMounts:
- name: driver-install-path
mountPath: /run/nvidia/driver
mountPropagation: HostToContainer
- name: run-nvidia-validations
mountPath: /run/nvidia/validations
mountPropagation: Bidirectional
- name: host-root
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
- name: host-dev-char
mountPath: /host-dev-char
@easyrider14 is your pod requesting a GPU using resource requests/limits (e.g. requesting an nvidia.com/gpu
resource)? This is the recommended way for requesting GPUs in Kubernetes and would solve this issue. The pod would not get scheduled on the newly added node until the GPU device-plugin is up and running (which only starts after both the NVIDIA driver and NVIDIA Container Toolkit are installed). If using resource requests/limits is not an option for you, then something along the lines of what @tariq1890 suggested would work.
You can add consider adding the
gpu-operator-validator
as an init container to your daemonset. This way, the daemonset would block on the nvidia-driver-daemonset transitioning to the Ready/Running state
Hi @tariq1890
I've tried this after digging in gpu-operator manifest files, but still have the same result I've made a simple test with an initContainer simply waiting for 10 minutes before exiting (a simple sleep 10m on an alpine image) When the initContainer exists, my container is run but still fails with no access to NVML. I thought the container would be created after the initContainer finishes but it does not seem to be the case. The container is like created but not starte until the initContainer terminates. If I delete the pod, the container is recreated and restarted and has directly access to NVML
@cdesiniotis I don't need/want the resources to be reserved for this pod, as it is mainly keeping a state of available resources on the node in an ETCD database. This is no workload running continuously, just a regular update of available ram/cpu/gpu at regular intervals. I don't want to reserve and block resources for that
Hi everyone
I face an issue with gpu-operator and scaling of my K8S cluster When adding a GPU node to cluster, gpu-operator will, amon others things, install container runtime and drivers I've got a daemonset which uses NVML, but it is scheduled on the newly added gpu node as soon as it is available. But the driver is not ready, and initializing NVML fails. The container in my pod exits, but the pod is restarted and not deleted/created, so NVML initialization still fails. Which criteria should I use in mu daemonset definition to make sur my pod will be able to initialize NVML and run correctly when it will be scheduled on the node ?
Thanks