NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

Installing the dgcm-exporter with Helm3 on OpenShift faces permissions issues #191

Open vemonet opened 3 years ago

vemonet commented 3 years ago

Hi we tried to install the dcgm-exporter (aka. gpu-monitoring-tool) on our OKD 4.6.0 cluster

The GPU node is a Nvidia DGX V100, installed using NVIDIA/k8s-device-plugin (properly integrated to the OKD 4.6 cluster)

We followed those instructions: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry

And a bit those one too: https://nvidia.github.io/gpu-monitoring-tools/

We try to run it without any arguments (as documented in the instructions):

helm install gpu-helm-charts/dcgm-exporter --generate-name

But it is showing an error in the pod logs:

unable to set CAP_SETFCAP effective capability: Operation not permitted

So we looked into the Values.yml, which shows that there are plenty of values that can be configured (note that it would be helpful for users to have a link to this document in the main docs, so that they quickly know where they can find more infos to make your Helm charts work on their Kubernetes cluster)

We tried also to use the anyuid service account, which allows to run as root in OpenShift and fixes permissions errors:

helm install gpu-helm-charts/dcgm-exporter --generate-name --set "serviceAccount.name=anyuid" --set "serviceAccount.create=false"

But we are getting the same permission error again: unable to set CAP_SETFCAP effective capability: Operation not permitted

We also tried to install it from the YAML file on the master branch:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/dcgm-exporter.yaml

But we get the following error: Failed to pull image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.3-ubuntu18.04": rpc error: code = Unknown desc = Error reading manifest 2.1.8-2.4.0-rc.3-ubuntu18.04 in nvcr.io/nvidia/k8s/dcgm-exporter: manifest unknown: manifest unknown

Meaning that the image used for the deployment on the master branch does not exist or is behind specific Nvidia authorizations, so we cannot deploy and try this deployment (should it be deleted if it is not usable anymore?)

On another note we also tried to define the same settings via the Values.yml file, and we added the nodeSelector this way:

nodeSelector:
  'nvidia.com/gpu': true

And run it this way: helm install gpu-helm-charts/dcgm-exporter --generate-name -f helm-dgcm-exporter.yml

But this gives another error:

Error: DaemonSet in version "v1" cannot be handled as a DaemonSet: v1.DaemonSet.Spec: v1.DaemonSetSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.NodeSelector: ReadString: expects " or n, but found t, error found in #10 byte of ...|com/gpu":true},"serv|..., bigger context ...|dOnly":true}]}],"nodeSelector":{"nvidia.com/gpu":true},"serviceAccountName":"anyuid","volumes":[{"ho|...

Which is weird because the provided YAML with nvidia.com/gpu seems legit, there is normally no need to escape . or / in keys when quoted, and this key is a really popular nodeSelector for Nvidia GPUs. Any idea how this nodeSelector can be set properly?

Is it possible to deploy the dcgm-exporter on an OpenShift based Kubernetes cluster?

Which configuration can be used to prevent the error unable to set CAP_SETFCAP effective capability: Operation not permitted? Maybe we need to fix the ClusterRole to give more permissions, instead than just the one provided by anyuid?