NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
404 stars 52 forks source link

The difference between release version and open-source version for dmon command (>1000) #72

Open ligeweiwu opened 1 year ago

ligeweiwu commented 1 year ago

Hi I have a question for the monitor command of dcgmi dmon -e 1009 (or any number greater than 1000). My working env is +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-PCI... Off | 00000000:B1:00.0 Off | 0 | | N/A 27C P0 34W / 250W | 475MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

When I install the release deb package and input "dcgmi dmon -e 1010", it gives me the expected result. But when I use the open-source code and build it by myself. When i input "dcgmi dmon -e 1010", it gives me an error "This request is serviced by a module of DCGM that is not currently loaded". Is there any difference between release version and open-source for dmon command (>1000).

DCGM version: 3.0.4

Thanks

nikkon-dev commented 1 year ago

@ligeweiwu,

The OSS DCGM version does not have the profiling module required for DCP fields (>1000) on GPUs before Hopper.

You can still use them with OSS if you copy libdcgmmoduleprofiling.so from an official DCGM package.

WBR, Nik

ligeweiwu commented 1 year ago

@nikkon-dev Hi nik, thanks for your reply. I have another question.
Now I want to test the api dcgmProfGetSupportedMetricGroups by means of the open source code (3.0.4). Thus, I use the command line ./dcgmi profile -l And it give me the feedback: Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.

So my question is, When i want to use dcgmProfGetSupportedMetricGroups, should I also copy libdcgmmoduleprofiling.so? Does the utilization of dcgmProfGetSupportedMetricGroups depend on libdcgmmoduleprofiling.so?

Thanks