NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

DRA kublet-plugin pod crashes on OpenShift - complains about NVML and prints usage #4

Closed empovit closed 7 months ago

empovit commented 1 year ago

I'm trying to run the DRA driver on OpenShift. Here's what I do:

  1. Enable the DRA feature gate
  2. Enable the resource.k8s.io/v1alpha2 API group
  3. Install the NFD operator and let it label the GPU node
  4. Install the NVIDIA GPU operator to have the GPU driver on the node
  5. Build a custom DRA driver image from Dockerfile.ubi8
  6. Label the node
    • oc label node <node> --overwrite nvidia.com/dra.kubelet-plugin="true"
    • oc label node <node> --overwrite nvidia.com/dra.controller="true"
  7. Make hostPath volumes work with OpenShift:
    • Set the security context to privileged
    • Add the privileged security profile to the service account
  8. Install the Helm chart with the customized driver image
  9. Creat a resource class with driverName: gpu.resource.nvidia.com
  10. Creat a resource claim template with spec.resourceClassName: gpu.nvidia.com
  11. Creat a pod that runs nvidia-smi -L and has

    resourceClaims:
      - name: gpu
        source:
          resourceClaimTemplateName: gpu-template

Unfortunately, the kubelet plugin pod keeps crashing:

# kubectl get pod -n nvidia-dra-driver
NAME                                                READY   STATUS             RESTARTS        AGE
nvidia-k8s-dra-driver-controller-59d756b8bf-k2bkf   1/1     Running            0               43m
nvidia-k8s-dra-driver-kubelet-plugin-2fn2w          0/1     CrashLoopBackOff   5 (2m23s ago)   6m26s

The log

Error: error enumerating all possible devices: error initializing NVML: ERROR_LIBRARY_NOT_FOUND
Error: error enumerating all possible devices: error initializing NVML: ERROR_LIBRARY_NOT_FOUND
Usage:
  nvidia-dra-plugin [flags]

Kubernetes client flags:

      --kube-api-burst int     Burst to use while communicating with the kubernetes apiserver. (default 10)
      --kube-api-qps float32   QPS to use while communicating with the kubernetes apiserver. (default 5)
      --kubeconfig string      Absolute path to the kube.config file. Either this or KUBECONFIG need to be set if the driver is being run out of cluster.

CDI flags:

      --cdi-root string   Absolute path to the directory where CDI files will be generated. (default "/etc/cdi")
klueska commented 1 year ago

This driver is still in a very early alpha state. I have only ever attempted to run in in a single node context on my local DGX A100 box and nowhere else. We plan to improve it to a beta state by December with proper integration in the GPU operator at that time. Until then it is very much "use at your own risk".

That said, if your issue is related to the plugin not finding NVML, then it is likely due to this: https://github.com/NVIDIA/k8s-dra-driver/blob/main/deployments/helm/k8s-dra-driver/templates/kubeletplugin.yaml#L72

That is currently hard-coded for my Ubuntu distribution on the DGX A100 that I use. It would likely need to be something different on open shift.

empovit commented 1 year ago

Thanks @klueska! Appending /run/nvidia/driver/usr/lib64/ did the trick.

- name: LD_LIBRARY_PATH
  value: /usr/lib64/:/run/nvidia/driver/usr/lib/x86_64-linux-gnu:/run/nvidia/driver/usr/lib64/

Now both pods seem happy, the kubelet plugin pod is running with this in the container log

I0807 14:35:02.967929       1 nonblockinggrpcserver.go:107] "dra: GRPC server started"
I0807 14:35:02.968048       1 nonblockinggrpcserver.go:107] "registrar: GRPC server started"

However, trying to run a pod with a resource claim does nothing - the pod remains waiting, and the resource claim/template is stuck in WaitForFirstConsumer.

I'm running this on a T4 GPU (AWS).

asm582 commented 10 months ago

@empovit Sorry this comment may not be related to the issue but do you know how long it takes on Openshift to configure a MIG slice on GPU?

elezar commented 7 months ago

@empovit given that we have updated our driver detection etc to also work with GKE's driver installation, could it be that your problem has also been addressed? (Note that you would have to configure the --driver-root when installing the DRA driver using helm).