NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

How do you install datacanter-gpu-manager-config package? #121

Open nguoido opened 10 months ago

nguoido commented 10 months ago

2023-10-26 14:44:40.652 ERROR [12179:12179] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2] 2023-10-26 14:44:40.652 ERROR [12179:12179] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]

I get this issue when run dcgmi diag -r 4

nikkon-dev commented 10 months ago

@nguoido,

The file diag-skus.yaml, which is not found, is a component of the regular datacenter-gpu-manager package and should be installed together. The datacenter-gpu-manager-config package is only created or released if there are changes to the SKUs definitions that we need to make available before the next DCGM release. This package does not exist under normal circumstances. Please verify the integrity of your datacenter-gpu-manager package.