NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
832 stars 150 forks source link

dcgm-exporter is not working on ec2 g5.48xlarge nodes #313

Open eselyavka opened 4 months ago

eselyavka commented 4 months ago

What is the version?

3.3.5-3.4.1-ubi9

What happened?

We are running dcgm-exported inside the containerd and on g5.48xlarge dcgm-exporter is struggling to come up online with this error

[root@test-machine ~]# ctr run --env DCGM_EXPORTER_DEBUG=true --cap-add CAP_SYS_ADMIN --rm nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9 dcgm-exporter
2024/04/10 21:53:33 maxprocs: Leaving GOMAXPROCS=192: CPU quota undefined
time="2024-04-10T21:53:33Z" level=info msg="Starting dcgm-exporter"
time="2024-04-10T21:53:33Z" level=debug msg="Debug output is enabled"
time="2024-04-10T21:53:33Z" level=debug msg="Command line: /usr/bin/dcgm-exporter"
time="2024-04-10T21:53:33Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/default-counters.csv Address::9400 CollectInterval:30000 Kubernetes:false KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock}"
Error: Failed to initialize NVML
time="2024-04-10T21:53:33Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:269 +0x3d\npanic({0x17dbac0?, 0x28fb390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc00026d1e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:509 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc00067a960)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:289 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:273 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cbda38?, 0xc000638550}, 0xc00049fb70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc000536a00)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:264 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc00062d080?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:249 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc00062d080, 0xc000536a00, {0xc0002a6050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc00034d200, {0x1cbd920?, 0x29c12a0}, {0xc0002a6050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00049ff20?, {0xc0002a6050?, 0x1?, 0x1616700?})\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"

What did you expect to happen?

dcgm-exporter running without any issues and we can see this in the log output, for example we have dcgm-exporter running successfully on g5.2xlarge ec2 node

2024/04/11 11:06:15 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-04-11T11:06:15Z" level=info msg="Starting dcgm-exporter"
time="2024-04-11T11:06:15Z" level=info msg="DCGM successfully initialized!"
time="2024-04-11T11:06:15Z" level=info msg="Collecting DCP Metrics"
time="2024-04-11T11:06:15Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-11T11:06:15Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-11T11:06:15Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-04-11T11:06:15Z" level=info msg="Starting webserver"
time="2024-04-11T11:06:15Z" level=info msg="Pipeline starting"
time="2024-04-11T11:06:15Z" level=info msg="Listening on" address="[::]:9400"
time="2024-04-11T11:06:15Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

What is the GPU model?

[root@test-machine ~]# nvidia-smi
Wed Apr 10 21:55:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:16.0 N/A |                  N/A |
|ERR!  ERR! ERR!               N/A /  N/A |      0MiB / 23028MiB |     N/A      Default |
|                                         |                      |                 ERR! |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:17.0 Off |                    0 |
|  0%   28C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:18.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:19.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A10G                    On  | 00000000:00:1A.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   28C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   28C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[root@test-machine ~]#

What is the environment?

We are spinning up kubernetes cluster via kops utility and running dcgm-exporter as a daemon sets on GPU ec2 instances. Version of the containerd is

[root@test-machine ~]# containerd --version
containerd github.com/containerd/containerd v1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8
[root@test-machine ~]#

Version of the kubelet is

[root@test-machine ~]# kubelet --version
Kubernetes v1.24.13

How did you deploy the dcgm-exporter and what is the configuration?

We are deploying dcgm-exporter as a helm chart via argocd.

How to reproduce the issue?

Try to run dcgm-exporter under the containerd on g5.48xlarge ec2 instance which is using OSS DLAMI Image

ctr image pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9
ctr run --env DCGM_EXPORTER_DEBUG=true --cap-add CAP_SYS_ADMIN --rm nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9 dcgm-exporter

Anything else we need to know?

No response

nvvfedorov commented 4 months ago

@eselyavka, Please make sure that the containerd uses NVIDIA runtime: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.

eselyavka commented 4 months ago

@nvvfedorov may be my runtime is pretty old on those DLAMI images

[root@test-machine ~]# yum info nvidia-container-toolkit
Loaded plugins: dkms-build-requires, extras_suggestions, kernel-livepatch, langpacks, priorities, update-motd, versionlock
Installed Packages
Name        : nvidia-container-toolkit
Arch        : x86_64
Version     : 1.13.5
Release     : 1
Size        : 2.3 M
Repo        : installed
From repo   : libnvidia-container
Summary     : NVIDIA Container Toolkit
URL         : https://github.com/NVIDIA/nvidia-container-toolkit
License     : Apache-2.0
Description : Provides tools and utilities to enable GPU support in containers.

I do not see any options in nvidia-ctk to configure runtime for containerd

[root@test-machine ~]# nvidia-ctk runtime configure --help
NAME:
   NVIDIA Container Toolkit CLI runtime configure - Add a runtime to the specified container engine

USAGE:
   NVIDIA Container Toolkit CLI runtime configure [command options] [arguments...]

OPTIONS:
[GR_GG_COF_AWS_Sandbox_TempPoCAccess]
   --dry-run                    update the runtime configuration as required but don't write changes to disk (default: false)
   --runtime value              the target runtime engine. One of [crio, docker] (default: "docker")
   --config value               path to the config file for the target runtime
   --nvidia-runtime-name value  specify the name of the NVIDIA runtime that will be added (default: "nvidia")
   --runtime-path value         specify the path to the NVIDIA runtime executable (default: "nvidia-container-runtime")
   --set-as-default             set the specified runtime as the default runtime (default: false)
   --help, -h                   show help (default: false)

As you can see --runtime accepting only [crio, docker] no containerd option.

I guess i have to try to update runtime to the latest version.

nvvfedorov commented 4 months ago

@eselyavka, The DCGM exporter depends on the Nvidia container runtime; please try to update the runtime configuration.