Closed empovit closed 7 months ago
This driver is still in a very early alpha
state. I have only ever attempted to run in in a single node context on my local DGX A100 box and nowhere else. We plan to improve it to a beta state by December with proper integration in the GPU operator at that time. Until then it is very much "use at your own risk".
That said, if your issue is related to the plugin not finding NVML, then it is likely due to this: https://github.com/NVIDIA/k8s-dra-driver/blob/main/deployments/helm/k8s-dra-driver/templates/kubeletplugin.yaml#L72
That is currently hard-coded for my Ubuntu distribution on the DGX A100 that I use. It would likely need to be something different on open shift.
Thanks @klueska! Appending /run/nvidia/driver/usr/lib64/
did the trick.
- name: LD_LIBRARY_PATH
value: /usr/lib64/:/run/nvidia/driver/usr/lib/x86_64-linux-gnu:/run/nvidia/driver/usr/lib64/
Now both pods seem happy, the kubelet plugin pod is running with this in the container log
I0807 14:35:02.967929 1 nonblockinggrpcserver.go:107] "dra: GRPC server started"
I0807 14:35:02.968048 1 nonblockinggrpcserver.go:107] "registrar: GRPC server started"
However, trying to run a pod with a resource claim does nothing - the pod remains waiting, and the resource claim/template is stuck in WaitForFirstConsumer
.
I'm running this on a T4 GPU (AWS).
@empovit Sorry this comment may not be related to the issue but do you know how long it takes on Openshift to configure a MIG slice on GPU?
@empovit given that we have updated our driver detection etc to also work with GKE's driver installation, could it be that your problem has also been addressed? (Note that you would have to configure the --driver-root
when installing the DRA driver using helm).
I'm trying to run the DRA driver on OpenShift. Here's what I do:
resource.k8s.io/v1alpha2
API groupDockerfile.ubi8
oc label node <node> --overwrite nvidia.com/dra.kubelet-plugin="true"
oc label node <node> --overwrite nvidia.com/dra.controller="true"
hostPath
volumes work with OpenShift:driverName: gpu.resource.nvidia.com
spec.resourceClassName: gpu.nvidia.com
Creat a pod that runs
nvidia-smi -L
and hasUnfortunately, the kubelet plugin pod keeps crashing:
The log