Closed eugenberend closed 3 years ago
hi @eugenberend - this is correct. For instance, the NVIDIA GPU Operator uses labels to identify what nodes the dcgm-exporter
daemonset should be enabled. You can either use the GPU Operator directly or apply node selectors to determine what nodes the dcgm-exporter
should be run on (preferably the nodes with GPUs).
We will look into adding some better error handling in dcgm-exporter
to deal with this scenario as well.
I want to export metrics from dedicated, external VM, using dcgm-exporter in k8s. When I apply the latest Helm chart, errors are logged:
so the pods are in CrashLoopBackOff state.
Here's how my custom values yaml file look:
My k8s cluster runs on nodes without GPU. I think that instead of trying to get some information from the remote (non-k8s) VM, dcgm-exporter pod tries to get devices from k8s node on which that pod is running.
For now, I suggest that k8s version of the dcgm-exporter should be runned only on the k8s cluster with GPU. Is my suggestion correct?