NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Running a new dcgm wrapper by using the dcgmInit() #104

Closed vinayburugu closed 11 months ago

vinayburugu commented 1 year ago

Is it a valid usage? I already enabled dcgm with sudo systemctl --now enable nvidia-dcgm . This will enable me to use the dcgmi CLI. Now, I want to create dcgmwrapper binary that uses the C APIs to initialize the dcgm and enable policies and register call back functions so that I can handle policy failures as below:

#didn't include the full code
 result = dcgmInit();
conditionBuffer |= DCGM_POLICY_COND_XID;
    myGroupPolicy.parms[6].tag         = dcgmPolicyConditionParams_t::BOOL;
    myGroupPolicy.parms[6].val.boolean = true;
    myGroupPolicy.condition = (dcgmPolicyCondition_t)conditionBuffer;
result = dcgmPolicyRegister(
            dcgmHandle, myGroupId, myGroupPolicy.condition, violationRegistrationCallback, violationRegistrationCallback);

Effectively, I have two dcgm processes running. Is this a valid usage of DCGM?

cc: @nikkon-dev

glowkey commented 1 year ago

Are you running 2 nv-hostengine processes? The dcgmwrapper can utilize the already-running dcgm/nv-hostengine instead of starting a 2nd hostengine process.

vinayburugu commented 1 year ago

Yes, I will be running two nv-hostengine processes effectively. I was planning to run the custom binary in embedded mode. Is it not a valid use case @glowkey ? Will running dcgm in embedded mode launches its own nv-hostengine? @glowkey

glowkey commented 1 year ago

Running 2 nv-hostengine processes within in the same OS instance is not advised.

vinayburugu commented 1 year ago

Are you running 2 nv-hostengine processes? The dcgmwrapper can utilize the already-running dcgm/nv-hostengine instead of starting a 2nd hostengine process.

So, can I have any number of custom dcgmwrapper binaries if I reuse the already running nv-hostengine process each wrapper instance with its own set of policies?

glowkey commented 1 year ago

Yes, that is normal usage.

vinayburugu commented 1 year ago
  1. Thanks for the reply @glowkey . Just to confirm my understanding, you recommend to use the standalone mode by connecting to the hostengine in the localhost using dcgmConnect(hostIpAddress, &dcgmHandle) API in the dcgmwrapper binaries? e.g. https://github.com/NVIDIA/DCGM/blob/master/sdk_samples/c_src/policy_sample/policy_sample.cpp#L110

  2. I presume sudo systemctl --now enable nvidia-dcgm will enable a nv-hostengine process, Isn't it? Can you please confirm?

  3. What is the resource utilization impact (cpu and gpu) by running a dcgm binary in the background by reusing the existing nv-hostengine? For e.g (i) for policy and callback registration (ii) job stats ?

cc: @nikkon-dev