NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm-exporter crashes while getting device cpu affinity #184

Closed eugenberend closed 3 years ago

eugenberend commented 3 years ago

I want to export metrics from dedicated, external VM, using dcgm-exporter in k8s. When I apply the latest Helm chart, errors are logged:

time="2021-05-08T13:03:29Z" level=info msg="Starting dcgm-exporter"
time="2021-05-08T13:03:29Z" level=info msg="Attemping to connect to remote hostengine at my-gpu-vm:5555"
time="2021-05-08T13:03:29Z" level=info msg="DCGM successfully initialized!"
time="2021-05-08T13:03:29Z" level=info msg="Collecting DCP Metrics"
time="2021-05-08T13:03:29Z" level=fatal msg="Error getting device cpu affinity: open /sys/bus/pci/devices/0000:8b:00.0/local_cpulist: no such file or directory"

so the pods are in CrashLoopBackOff state.

Here's how my custom values yaml file look:

arguments:
  - "-f"
  - "/etc/dcgm-exporter/dcp-metrics-included.csv"
  - "-r"
  - "my-gpu-vm:5555"

My k8s cluster runs on nodes without GPU. I think that instead of trying to get some information from the remote (non-k8s) VM, dcgm-exporter pod tries to get devices from k8s node on which that pod is running.

For now, I suggest that k8s version of the dcgm-exporter should be runned only on the k8s cluster with GPU. Is my suggestion correct?

dualvtable commented 3 years ago

hi @eugenberend - this is correct. For instance, the NVIDIA GPU Operator uses labels to identify what nodes the dcgm-exporter daemonset should be enabled. You can either use the GPU Operator directly or apply node selectors to determine what nodes the dcgm-exporter should be run on (preferably the nodes with GPUs).

We will look into adding some better error handling in dcgm-exporter to deal with this scenario as well.