Closed marceloamaral closed 5 months ago
The DCP metrics require root permissions. In your example
./field_value_sample/field_value_sample
Start DCGM Host Engine in:
0 - Embedded mode
1 - Standalone mode
0
You need to run the field_value_sample under sudo
Thanks, @nikkon-dev, for the reply!
I am running both field_value_sample
and dcgmi
in a privileged container. Therefore, both executions were done with root access. However, as I mentioned, I can only retrieve the values for the 1004 metric with dcgmi, but not with the library. Please note that I can obtain metrics with lower numbers, such as 155 with field_value_sample
. The issue seems to be specific to metrics that start with ~1000.
Additionally, this is a Kubernetes cluster with the DCGM operator installed. I'm not sure if it might affect anything.
Different processes cannot observe DCP metrics on the same GPU - this is a hardware limitation and the reason why DCGM and profiling cannot be done simultaneously. Is there a chance your dcgmi command connects to a standalone nv-hostengine process (that may run inside the same container or in a dedicated container)?
Is there a chance your dcgmi command connects to a standalone
Thank you, @nikkon-dev. It seems that the dcgmi command is connecting with a standalone nv-hostengine process on localhost.
I tried the field_value_sample connecting to localhost, and it worked.
Therefore, I am closing this issue.
I am trying the
sdk_samples/c_src/field_value_sample
but changing the field 0 to get the GPU utilization.The output is:
I can get the metrics running the
dcgmi
daemon, but the cpp library or other golang wrappers do not work.Note that I tested in with MIG enabled and disabled, but either cases the lib did not work.