DRA kublet-plugin pod crashes on OpenShift - complains about NVML and prints usage

empovit commented 1 year ago

I'm trying to run the DRA driver on OpenShift. Here's what I do:

Enable the DRA feature gate
Enable the resource.k8s.io/v1alpha2 API group
Install the NFD operator and let it label the GPU node
Install the NVIDIA GPU operator to have the GPU driver on the node
Build a custom DRA driver image from Dockerfile.ubi8
Label the node
- oc label node <node> --overwrite nvidia.com/dra.kubelet-plugin="true"
- oc label node <node> --overwrite nvidia.com/dra.controller="true"
Make hostPath volumes work with OpenShift:
- Set the security context to privileged
- Add the privileged security profile to the service account
Install the Helm chart with the customized driver image
Creat a resource class with driverName: gpu.resource.nvidia.com
Creat a resource claim template with spec.resourceClassName: gpu.nvidia.com

Creat a pod that runs nvidia-smi -L and has

resourceClaims:
  - name: gpu
    source:
      resourceClaimTemplateName: gpu-template

Unfortunately, the kubelet plugin pod keeps crashing:

# kubectl get pod -n nvidia-dra-driver
NAME                                                READY   STATUS             RESTARTS        AGE
nvidia-k8s-dra-driver-controller-59d756b8bf-k2bkf   1/1     Running            0               43m
nvidia-k8s-dra-driver-kubelet-plugin-2fn2w          0/1     CrashLoopBackOff   5 (2m23s ago)   6m26s

The log

Error: error enumerating all possible devices: error initializing NVML: ERROR_LIBRARY_NOT_FOUND
Error: error enumerating all possible devices: error initializing NVML: ERROR_LIBRARY_NOT_FOUND
Usage:
  nvidia-dra-plugin [flags]

Kubernetes client flags:

      --kube-api-burst int     Burst to use while communicating with the kubernetes apiserver. (default 10)
      --kube-api-qps float32   QPS to use while communicating with the kubernetes apiserver. (default 5)
      --kubeconfig string      Absolute path to the kube.config file. Either this or KUBECONFIG need to be set if the driver is being run out of cluster.

CDI flags:

      --cdi-root string   Absolute path to the directory where CDI files will be generated. (default "/etc/cdi")

klueska commented 1 year ago

This driver is still in a very early alpha state. I have only ever attempted to run in in a single node context on my local DGX A100 box and nowhere else. We plan to improve it to a beta state by December with proper integration in the GPU operator at that time. Until then it is very much "use at your own risk".

That said, if your issue is related to the plugin not finding NVML, then it is likely due to this: https://github.com/NVIDIA/k8s-dra-driver/blob/main/deployments/helm/k8s-dra-driver/templates/kubeletplugin.yaml#L72

That is currently hard-coded for my Ubuntu distribution on the DGX A100 that I use. It would likely need to be something different on open shift.

empovit commented 1 year ago

Thanks @klueska! Appending /run/nvidia/driver/usr/lib64/ did the trick.

- name: LD_LIBRARY_PATH
  value: /usr/lib64/:/run/nvidia/driver/usr/lib/x86_64-linux-gnu:/run/nvidia/driver/usr/lib64/

Now both pods seem happy, the kubelet plugin pod is running with this in the container log

I0807 14:35:02.967929       1 nonblockinggrpcserver.go:107] "dra: GRPC server started"
I0807 14:35:02.968048       1 nonblockinggrpcserver.go:107] "registrar: GRPC server started"

However, trying to run a pod with a resource claim does nothing - the pod remains waiting, and the resource claim/template is stuck in WaitForFirstConsumer.

I'm running this on a T4 GPU (AWS).

asm582 commented 10 months ago

@empovit Sorry this comment may not be related to the issue but do you know how long it takes on Openshift to configure a MIG slice on GPU?

elezar commented 7 months ago

@empovit given that we have updated our driver detection etc to also work with GKE's driver installation, could it be that your problem has also been addressed? (Note that you would have to configure the --driver-root when installing the DRA driver using helm).

NVIDIA / k8s-dra-driver

DRA kublet-plugin pod crashes on OpenShift - complains about NVML and prints usage #4