NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

Error setting watches. Result: The third-party Profiling module returned an unrecoverable error #151

Closed marceloamaral closed 5 months ago

marceloamaral commented 5 months ago

I am trying the sdk_samples/c_src/field_value_sample but changing the field 0 to get the GPU utilization.

//fieldIds[0] = DCGM_FI_DEV_POWER_USAGE;
fieldIds[0] = DCGM_FI_PROF_SM_ACTIVE;

The output is:

./field_value_sample/field_value_sample
Start DCGM Host Engine in:
0 - Embedded mode
1 - Standalone mode
0

Embedded mode selected.
DCGM Initialized.
Available DCGM-Supported GPUs: 0   1   2   3   4   5   6   7
Successfully created group with group ID: 2
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error
Cleaning up.

I can get the metrics running the dcgmi daemon, but the cpp library or other golang wrappers do not work.

dcgmi dmon -e 1001,1002,1003,1004 -g 35
#Entity   GRACT        SMACT        SMOCC        TENSO
ID
GPU-I 21  0.000        0.000        0.000        0.000
GPU-I 22  0.000        0.000        0.000        0.000
GPU-I 29  0.495        0.490        0.061        0.468
GPU-I 30  0.493        0.487        0.061        0.465
GPU-I 31  0.494        0.488        0.061        0.466
GPU-I 28  0.993        0.987        0.123        0.940
GPU-I 35  0.000        0.000        0.000        0.000
dcgmi dmon -e 1001,1002,1003,1004 -i 4
#Entity   GRACT        SMACT        SMOCC        TENSO
ID
GPU 4     0.571        0.571        0.071        0.544
GPU 4     0.565        0.560        0.070        0.534
GPU 4     0.565        0.560        0.070        0.534
GPU 4     0.566        0.560        0.070        0.534
GPU 4     0.566        0.561        0.070        0.535
GPU 4     0.565        0.559        0.070        0.534

Note that I tested in with MIG enabled and disabled, but either cases the lib did not work.

nikkon-dev commented 5 months ago

The DCP metrics require root permissions. In your example

./field_value_sample/field_value_sample
Start DCGM Host Engine in:
0 - Embedded mode
1 - Standalone mode
0

You need to run the field_value_sample under sudo

marceloamaral commented 5 months ago

Thanks, @nikkon-dev, for the reply!

I am running both field_value_sample and dcgmi in a privileged container. Therefore, both executions were done with root access. However, as I mentioned, I can only retrieve the values for the 1004 metric with dcgmi, but not with the library. Please note that I can obtain metrics with lower numbers, such as 155 with field_value_sample. The issue seems to be specific to metrics that start with ~1000.

Additionally, this is a Kubernetes cluster with the DCGM operator installed. I'm not sure if it might affect anything.

nikkon-dev commented 5 months ago

Different processes cannot observe DCP metrics on the same GPU - this is a hardware limitation and the reason why DCGM and profiling cannot be done simultaneously. Is there a chance your dcgmi command connects to a standalone nv-hostengine process (that may run inside the same container or in a dedicated container)?

marceloamaral commented 5 months ago

Is there a chance your dcgmi command connects to a standalone

Thank you, @nikkon-dev. It seems that the dcgmi command is connecting with a standalone nv-hostengine process on localhost.

I tried the field_value_sample connecting to localhost, and it worked.

Therefore, I am closing this issue.