unable to set CAP_SETFCAP effective capability: Operation not permitted
So we looked into the Values.yml, which shows that there are plenty of values that can be configured (note that it would be helpful for users to have a link to this document in the main docs, so that they quickly know where they can find more infos to make your Helm charts work on their Kubernetes cluster)
We tried also to use the anyuid service account, which allows to run as root in OpenShift and fixes permissions errors:
But we get the following error: Failed to pull image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.3-ubuntu18.04": rpc error: code = Unknown desc = Error reading manifest 2.1.8-2.4.0-rc.3-ubuntu18.04 in nvcr.io/nvidia/k8s/dcgm-exporter: manifest unknown: manifest unknown
Meaning that the image used for the deployment on the master branch does not exist or is behind specific Nvidia authorizations, so we cannot deploy and try this deployment (should it be deleted if it is not usable anymore?)
On another note we also tried to define the same settings via the Values.yml file, and we added the nodeSelector this way:
nodeSelector:
'nvidia.com/gpu': true
And run it this way: helm install gpu-helm-charts/dcgm-exporter --generate-name -f helm-dgcm-exporter.yml
But this gives another error:
Error: DaemonSet in version "v1" cannot be handled as a DaemonSet: v1.DaemonSet.Spec: v1.DaemonSetSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.NodeSelector: ReadString: expects " or n, but found t, error found in #10 byte of ...|com/gpu":true},"serv|..., bigger context ...|dOnly":true}]}],"nodeSelector":{"nvidia.com/gpu":true},"serviceAccountName":"anyuid","volumes":[{"ho|...
Which is weird because the provided YAML with nvidia.com/gpu seems legit, there is normally no need to escape . or / in keys when quoted, and this key is a really popular nodeSelector for Nvidia GPUs. Any idea how this nodeSelector can be set properly?
Is it possible to deploy the dcgm-exporter on an OpenShift based Kubernetes cluster?
Which configuration can be used to prevent the error unable to set CAP_SETFCAP effective capability: Operation not permitted? Maybe we need to fix the ClusterRole to give more permissions, instead than just the one provided by anyuid?
Hi we tried to install the dcgm-exporter (aka. gpu-monitoring-tool) on our OKD 4.6.0 cluster
The GPU node is a Nvidia DGX V100, installed using NVIDIA/k8s-device-plugin (properly integrated to the OKD 4.6 cluster)
We followed those instructions: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry
And a bit those one too: https://nvidia.github.io/gpu-monitoring-tools/
We try to run it without any arguments (as documented in the instructions):
But it is showing an error in the pod logs:
So we looked into the
Values.yml
, which shows that there are plenty of values that can be configured (note that it would be helpful for users to have a link to this document in the main docs, so that they quickly know where they can find more infos to make your Helm charts work on their Kubernetes cluster)We tried also to use the
anyuid
service account, which allows to run as root in OpenShift and fixes permissions errors:But we are getting the same permission error again:
unable to set CAP_SETFCAP effective capability: Operation not permitted
We also tried to install it from the YAML file on the
master
branch:But we get the following error:
Failed to pull image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.3-ubuntu18.04": rpc error: code = Unknown desc = Error reading manifest 2.1.8-2.4.0-rc.3-ubuntu18.04 in nvcr.io/nvidia/k8s/dcgm-exporter: manifest unknown: manifest unknown
Meaning that the image used for the deployment on the master branch does not exist or is behind specific Nvidia authorizations, so we cannot deploy and try this deployment (should it be deleted if it is not usable anymore?)
On another note we also tried to define the same settings via the
Values.yml
file, and we added thenodeSelector
this way:And run it this way:
helm install gpu-helm-charts/dcgm-exporter --generate-name -f helm-dgcm-exporter.yml
But this gives another error:
Which is weird because the provided YAML with
nvidia.com/gpu
seems legit, there is normally no need to escape . or / in keys when quoted, and this key is a really popular nodeSelector for Nvidia GPUs. Any idea how this nodeSelector can be set properly?Is it possible to deploy the dcgm-exporter on an OpenShift based Kubernetes cluster?
Which configuration can be used to prevent the error
unable to set CAP_SETFCAP effective capability: Operation not permitted
? Maybe we need to fix the ClusterRole to give more permissions, instead than just the one provided byanyuid
?