NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

dcgm-exporter doesn't start on Docker #134

Open gurapomu opened 3 years ago

gurapomu commented 3 years ago
# nvidia-docker run -d -p 9400:9400 nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu20.04
d7dcb63a3e630c4ce72e2eb13e97ae721e7d74bd407477acdbc52c4b8aac7f83
# nvidia-docker logs d7
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2020-11-27T05:42:40Z" level=info msg="Starting dcgm-exporter"
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
SIGABRT: abort
PC=0x7fef53b6b18b m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fef53b6b18b
stack: frame={sp:0x7fff9d737170, fp:0x0} stack=[0x7fff9cf405b8,0x7fff9d73f5f0)
00007fff9d737070:  0000000000000000  0000000000000000
00007fff9d737080:  0000000000000000  0000000000000000
00007fff9d737090:  0000000000000000  0000000000000000
00007fff9d7370a0:  fffffffffffffff8  0000000000000001
00007fff9d7370b0:  0000000000000000  0000000000000000
00007fff9d7370c0:  0000000000000000  0000000000000030
00007fff9d7370d0:  0000000000000007  0000000000000000
00007fff9d7370e0:  0000000000000000  0000000000000000
00007fff9d7370f0:  ffffffffffffff00  ffffffffffffff00
00007fff9d737100:  0000000000000000  0000000000000000
00007fff9d737110:  0000000000000000  00007fef25e48e9d
00007fff9d737120:  4000000000000000  0000000000000000
00007fff9d737130:  0000000000000000  0000000000000000
00007fff9d737140:  0000000000000000  0000000000000000
00007fff9d737150:  00007fff9d737390  00007fff9d737398
00007fff9d737160:  74493a3a73656c69  3a3a726f74617265
00007fff9d737170: <0000000000000000  0000000000000000
00007fff9d737180:  0000000020202020  0000000000000000
00007fff9d737190:  0000000100000000  0000000000000000
00007fff9d7371a0:  0000000000000000  0000000000000000
00007fff9d7371b0:  0000000000000000  000000000000000a
00007fff9d7371c0:  0000000000000000  0000000000ec0cf0
00007fff9d7371d0:  0000000000000004  0000000000000004
00007fff9d7371e0:  0000000000000000  0000000000000000
00007fff9d7371f0:  fffffffe7fffffff  ffffffffffffffff
00007fff9d737200:  ffffffffffffffff  ffffffffffffffff
00007fff9d737210:  ffffffffffffffff  ffffffffffffffff
00007fff9d737220:  ffffffffffffffff  ffffffffffffffff
00007fff9d737230:  ffffffffffffffff  ffffffffffffffff
00007fff9d737240:  ffffffffffffffff  ffffffffffffffff
00007fff9d737250:  ffffffffffffffff  ffffffffffffffff
00007fff9d737260:  ffffffffffffffff  ffffffffffffffff
runtime: unknown pc 0x7fef53b6b18b
stack: frame={sp:0x7fff9d737170, fp:0x0} stack=[0x7fff9cf405b8,0x7fff9d73f5f0)
00007fff9d737070:  0000000000000000  0000000000000000
00007fff9d737080:  0000000000000000  0000000000000000
00007fff9d737090:  0000000000000000  0000000000000000
00007fff9d7370a0:  fffffffffffffff8  0000000000000001
00007fff9d7370b0:  0000000000000000  0000000000000000
00007fff9d7370c0:  0000000000000000  0000000000000030
00007fff9d7370d0:  0000000000000007  0000000000000000
00007fff9d7370e0:  0000000000000000  0000000000000000
00007fff9d7370f0:  ffffffffffffff00  ffffffffffffff00
00007fff9d737100:  0000000000000000  0000000000000000
00007fff9d737110:  0000000000000000  00007fef25e48e9d
00007fff9d737120:  4000000000000000  0000000000000000
00007fff9d737130:  0000000000000000  0000000000000000
00007fff9d737140:  0000000000000000  0000000000000000
00007fff9d737150:  00007fff9d737390  00007fff9d737398
00007fff9d737160:  74493a3a73656c69  3a3a726f74617265
00007fff9d737170: <0000000000000000  0000000000000000
00007fff9d737180:  0000000020202020  0000000000000000
00007fff9d737190:  0000000100000000  0000000000000000
00007fff9d7371a0:  0000000000000000  0000000000000000
00007fff9d7371b0:  0000000000000000  000000000000000a
00007fff9d7371c0:  0000000000000000  0000000000ec0cf0
00007fff9d7371d0:  0000000000000004  0000000000000004
00007fff9d7371e0:  0000000000000000  0000000000000000
00007fff9d7371f0:  fffffffe7fffffff  ffffffffffffffff
00007fff9d737200:  ffffffffffffffff  ffffffffffffffff
00007fff9d737210:  ffffffffffffffff  ffffffffffffffff
00007fff9d737220:  ffffffffffffffff  ffffffffffffffff
00007fff9d737230:  ffffffffffffffff  ffffffffffffffff
00007fff9d737240:  ffffffffffffffff  ffffffffffffffff
00007fff9d737250:  ffffffffffffffff  ffffffffffffffff
00007fff9d737260:  ffffffffffffffff  ffffffffffffffff

goroutine 1 [syscall]:
runtime.cgocall(0x91b600, 0xc00041faf0, 0xc000000001)
        /usr/local/go/src/runtime/cgocall.go:133 +0x5b fp=0xc00041fac0 sp=0xc00041fa88 pc=0x4068bb
github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm._Cfunc_dcgmStartEmbedded(0x1, 0xc000029458, 0x0)
        _cgo_gotypes.go:948 +0x4d fp=0xc00041faf0 sp=0xc00041fac0 pc=0x50759d
github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm.startEmbedded(0x170c820, 0x101)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/admin.go:106 +0xcb fp=0xc00041fb50 sp=0xc00041faf0 pc=0x507feb
github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm.initDcgm(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/admin.go:75 +0x188 fp=0xc00041fbc0 sp=0xc00041fb50 pc=0x507d78
github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm.Init(0x0, 0x0, 0x0, 0x0, 0x1, 0x8, 0xc00041fca8)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/api.go:27 +0xf6 fp=0xc00041fc28 sp=0xc00041fbc0 pc=0x5067f6
main.Run(0xc0002debc0, 0x0, 0x0)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/pkg/main.go:99 +0xce fp=0xc00041fd78 sp=0xc00041fc28 pc=0x91759e
main.main.func1(0xc0002debc0, 0xc00000f300, 0xa)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/pkg/main.go:85 +0x2b fp=0xc00041fda0 sp=0xc00041fd78 pc=0x91a77b
github.com/urfave/cli/v2.(*App).RunContext(0xc000001980, 0xb20560, 0xc000028050, 0xc0000201d0, 0x1, 0x1, 0x0, 0x0)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/vendor/github.com/urfave/cli/v2/app.go:315 +0x70b fp=0xc00041fec8 sp=0xc00041fda0 pc=0x8efa1b
github.com/urfave/cli/v2.(*App).Run(...)
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
        /go/src/github.com/NVIDIA/gpu-monitoring-tools/pkg/main.go:88 +0x73f fp=0xc00041ff88 sp=0xc00041fec8 pc=0x91721f
runtime.main()
        /usr/local/go/src/runtime/proc.go:203 +0x212 fp=0xc00041ffe0 sp=0xc00041ff88 pc=0x438d42
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00041ffe8 sp=0xc00041ffe0 pc=0x4663a1

rax    0x0
rbx    0x7fef53b22740
rcx    0x7fef53b6b18b
rdx    0x0
rdi    0x2
rsi    0x7fff9d737170
rbp    0x17120d8
rsp    0x7fff9d737170
r8     0x0
r9     0x7fff9d737170
r10    0x8
r11    0x246
r12    0x173da60
r13    0x0
r14    0x7fef25faf260
r15    0x0
rip    0x7fef53b6b18b
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

more info

# nvidia-smi
Fri Nov 27 14:53:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   25C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   26C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   27C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                   On |
| N/A   29C    P0    41W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                   On |
| N/A   28C    P0    43W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                   On |
| N/A   28C    P0    42W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                   On |
| N/A   29C    P0    42W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# nvidia-docker version
NVIDIA Docker: 2.5.0
Client: Docker Engine - Community
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 17:02:52 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:01:20 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 nvidia:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

# cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2020-10-26-11-53-11"
DGX_SWBUILD_VERSION="5.0.0"
DGX_COMMIT_ID="7501dff"
DGX_PLATFORM="DGX Server for DGX A100"

but, it is working in version 1.7.2. What are breaking changes between 1.7.2 and latest?

dualvtable commented 3 years ago

thanks for reporting this issue - we made some significant architecture changes between 1.7.2 and 2.x.y. The current variation of dcgm-exporter doesn't support GPUs in MIG mode yet - so this is a legitimate bug that we need to investigate.

andiariffin commented 3 years ago

Hi, I got the same issue, looks like it only happened when there is a MIG-enabled GPU:

nvidia@esc4000:~$ nvidia-smi 
Wed Jan 27 15:25:16 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:01:00.0 Off |                    0 |
| N/A   31C    P0    45W / 250W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      On   | 00000000:41:00.0 Off |                    0 |
| N/A   31C    P0    45W / 250W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-PCIE-40GB      On   | 00000000:81:00.0 Off |                   On |
| N/A   27C    P0    31W / 250W |     25MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-PCIE-40GB      On   | 00000000:C1:00.0 Off |                   On |
| N/A   28C    P0    33W / 250W |     25MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
nvidia@esc4000:~$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=0,1 --rm nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-01-27T07:27:55Z" level=info msg="Starting dcgm-exporter"
time="2021-01-27T07:27:55Z" level=info msg="DCGM successfully initialized!"
time="2021-01-27T07:27:55Z" level=info msg="Collecting DCP Metrics"
time="2021-01-27T07:27:55Z" level=info msg="Pipeline starting"
time="2021-01-27T07:27:55Z" level=info msg="Starting webserver"
nvidia@esc4000:~$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=2,3 --rm nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-01-27T07:28:12Z" level=info msg="Starting dcgm-exporter"
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
SIGABRT: abort
PC=0x7f6fa98bd18b m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f6fa98bd18b
stack: frame={sp:0x7fff873a4d80, fp:0x0} stack=[0x7fff833ae4b8,0x7fff873ad4f0)
00007fff873a4c80:  0000000000000000  0000000000000000 
00007fff873a4c90:  0000000000000000  0000000000000000
...
gurapomu commented 3 years ago

@dualvtable Did you resolve this problem on 2.4.0-rc.2? I found a document. https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#multi-instance-gpu-mig-support