NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Error setting up dcgm with startHostEngine mode from a golang based container #168

Closed haardm closed 4 months ago

haardm commented 4 months ago

I am creating a monitoring-agent based on golang using docker to build the image, and also install dcgm. My golang application uses startHostEngine mode to init dcgm client.

This agent image is pulled in a kubernetes pod as a daemonset. Inside the pod, I am getting below error. error connecting to nv-hostengine: Host engine connection invalid/disconnected

Earlier, I had a separate container in the node to run my nvidia-dcgm image nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubuntu22.04, and used standAlone mode to connect- it worked fine.

I was able to successfully run it using embedded mode and eliminate the use of separate dcgm server container. But this broke my capability to ssh into the ec2 instance and run dcgmi test --inject commands to test error scenarios.

  1. Is there a way to run dcgmi test with embedded mode that could work for my setup? I have also tried to make it work by ssh'ing inside the kubernetes pod of monitoring-agent but that does not work and I get below error.
    sh-4.2$ dcgmi test --inject --gpuid 0 -f 202 -v 99999
    Error: unable to establish a connection to the specified host: localhost
    Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

    Just FYI, in this setup, I do not get any errors for dcgm.Init(dcgm.Embedded)

  2. I switched to using dcgm.Init(dcgm.StartHostEngine) as StartHostengine is the mode which starts nv-hostengine, and also gives me the hope that it would eliminate server container + able to test using dcgmi. But currently I am facing init errors. Error connecting to nv-hostengine: Host engine connection invalid/disconnected
haardm commented 4 months ago

Wrong repo, this is the correct repo https://github.com/NVIDIA/go-dcgm/issues/66