Open gurapomu opened 3 years ago
thanks for reporting this issue - we made some significant architecture changes between 1.7.2 and 2.x.y. The current variation of dcgm-exporter
doesn't support GPUs in MIG mode yet - so this is a legitimate bug that we need to investigate.
Hi, I got the same issue, looks like it only happened when there is a MIG-enabled GPU:
nvidia@esc4000:~$ nvidia-smi
Wed Jan 27 15:25:16 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 31C P0 45W / 250W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 31C P0 45W / 250W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-PCIE-40GB On | 00000000:81:00.0 Off | On |
| N/A 27C P0 31W / 250W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-PCIE-40GB On | 00000000:C1:00.0 Off | On |
| N/A 28C P0 33W / 250W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
nvidia@esc4000:~$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=0,1 --rm nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-01-27T07:27:55Z" level=info msg="Starting dcgm-exporter"
time="2021-01-27T07:27:55Z" level=info msg="DCGM successfully initialized!"
time="2021-01-27T07:27:55Z" level=info msg="Collecting DCP Metrics"
time="2021-01-27T07:27:55Z" level=info msg="Pipeline starting"
time="2021-01-27T07:27:55Z" level=info msg="Starting webserver"
nvidia@esc4000:~$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=2,3 --rm nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-01-27T07:28:12Z" level=info msg="Starting dcgm-exporter"
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
SIGABRT: abort
PC=0x7f6fa98bd18b m=0 sigcode=18446744073709551610
goroutine 0 [idle]:
runtime: unknown pc 0x7f6fa98bd18b
stack: frame={sp:0x7fff873a4d80, fp:0x0} stack=[0x7fff833ae4b8,0x7fff873ad4f0)
00007fff873a4c80: 0000000000000000 0000000000000000
00007fff873a4c90: 0000000000000000 0000000000000000
...
@dualvtable Did you resolve this problem on 2.4.0-rc.2? I found a document. https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#multi-instance-gpu-mig-support
more info
but, it is working in version 1.7.2. What are breaking changes between 1.7.2 and latest?