NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

DCGM Exporter in EKS p4d.24xlarge instance type controller error #387

Open camilopaezrios opened 2 months ago

camilopaezrios commented 2 months ago

What is the version?

3.4.2.

What happened?

I have EKS cluster to run some heavy GPU tasks and want to integrate monitoring with Datadog. I am stuck in deploying the DCGM exporter in my prod environment (multiple p4d.24xlarge) but worked in my dev environment (using a p3.2xlarge for cheaping a little) with the same AMI AL2_X86_64_GPU - amazon-eks-gpu-node-1.29-v20240729. The error I am getting is: level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc0005831e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc000321360)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc00002e5a0}, 0xc00044db70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0002a0380)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000274a0?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(Command).Run(0xc0000274a0, 0xc0002a0380, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(App).RunContext(0xc000057400, {0x1cf3300?, 0x2a0c420}, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00044df20?, {0xc000040150?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"

The installation is done via Helm as per this document https://docs.datadoghq.com/integrations/dcgm/?tab=kubernetes. Using VERSION 3.4.2 rather than latest because it triggers an error https://github.com/NVIDIA/dcgm-exporter/issues/318

Variables DCGM_FI_DEV_COUNT, DCGM_FI_PROCESS_NAME, & DCGM_FI_CUDA_DRIVER_VERSION were commented to not report as triggers an error https://github.com/NVIDIA/dcgm-exporter/issues/318

What did you expect to happen?

Agent running properly

What is the GPU model?

p4d.24xlarge

What is the environment?

AWS EKS

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response