NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
354 stars 49 forks source link

Questions about fields identifiers #48

Open cyLi-Tiger opened 1 year ago

cyLi-Tiger commented 1 year ago

I have 3 questions about DCGM, I noticed that there are field identifiers like memory utilization and gpu utilization.

  1. How these methods are calculated?
  2. What if I want to monitor the running status of each core of GPU, is there any api for that?
  3. Is there any way to monitor the highest GPU memory use during a time period? I only have a toy plan: collect multiple records during a time period and return the maximum among them.
nikkon-dev commented 1 year ago

1. How these methods are calculated? In general - The duration of the SMs being busy vs. non-busy during the polling interval. That's for utilization. For occupancy - how many SMs were active vs. how many were idle.

2. What if I want to monitor the running status of each core of GPU, is there any api for that? There is no such API. This information is not exposed from the hardware level. We get already aggregated metric values from the underlying hardware.

3. Is there any way to monitor the highest GPU memory use during a time period? I only have a toy plan: collect multiple records during a time period and return the maximum among them. If you are using DCGM API, you may specify how long/how many values should be stored. Look for the dcgmWatchFields API and its maxKeepAge/maxKeepSamples arguments. The dcgmi would not provide any aggregation. Then you could use dcgmGetValuesSince_v2 to get multiple samples for the period you need.

nikkon-dev commented 1 year ago

Clarification on the first question: we are using NVML API to get values for the fields you mentioned in another issue (DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL) and here is the NVML documentation related to them:

/**
 * Utilization information for a device.
 * Each sample period may be between 1 second and 1/6 second, depending on the product being queried.
 */
typedef struct nvmlUtilization_st
{
    unsigned int gpu;                //!< Percent of time over the past sample period during which one or more kernels was executing on the GPU
    unsigned int memory;             //!< Percent of time over the past sample period during which global (device) memory was being read or written
} nvmlUtilization_t;
cyLi-Tiger commented 1 year ago

3. Is there any way to monitor the highest GPU memory use during a time period? I only have a toy plan: collect multiple records during a time period and return the maximum among them. If you are using DCGM API, you may specify how long/how many values should be stored. Look for the dcgmWatchFields API and its maxKeepAge/maxKeepSamples arguments. The dcgmi would not provide any aggregation. Then you could use dcgmGetValuesSince_v2 to get multiple samples for the period you need.

@nikkon-dev Thanks for your reply!

I have few questions about dcgmWatchFields. I noticed there is an attribute updateFreq in dcgmWatchFields, dcgmWatchFields just tell the server to start recording the updates for given fields. So from my perspective, this api just keep recording updates with frequency updateFreq and storing them for maxKeepAge.

But according to the implementation in triton Model Analyzer's dcgm monitor, the dcgm_monitor has a loop to supervise the GPU behavior, and each iteration is actually using dcgmGetValuesSince to request updates fields.

My question is, it seems that the frequency to get data is controlled by the loop I mentioned above, so what's the point of updateFreq in dcgmWatchFields?

starry91 commented 1 year ago

@nikkon-dev I was going through the different dcgm fields and had the following questions. Could you please help me with them.

Question-1: What is the difference between the below fields DCGM_FI_DEV_GPU_UTIL vs DCGM_FI_PROF_GR_ENGINE_ACTIVE DCGM_FI_DEV_MEM_COPY_UTIL vs DCGM_FI_PROF_DRAM_ACTIVE

Question-2: I am looking to track the following metrics for my AI workloads:

  1. GPU utilization over time - Idea here is to understand if the task is utilizing the GPU efficiently or is it just wasting GPU resources
  2. Memory utilization over time - Idea here is to know if there is scope to increase the batch size of the AI workload
  3. How effectively is my task utilizing the GPU parallelization capabilities? Some stat to understand if there is further scope to parallelize the computation, something like % core utilization and total cores. (Perhaps DCGM_FI_PROF_SM_ACTIVE?)
  4. Can I increase my computation batch size? This I believe should come from some memory utilization stat.

Could you please advice what fields I should look into for the above stats? Note that I require the same set of metrics both on tesla T4 and A100 (MIG) cards. (Asking as issue#58 seems to mentions that the above mentioned *_DEV_* fields do not work for MIGs)

Also dcgm documentation says not all fields can be queried in parallel. Does this apply only for the *_PROF* fields or even the *_DEV* fields? More specifically I wanted to know if the DCGM_FI_DEV_GPU_UTIL can be allotted to any group or does it need to be part of its own group?

frittentheke commented 1 week ago

@nikkon-dev May I also ask you for some clarification on the metrics and fields that @starry91 asked about?

Apparently the question moved to https://github.com/NVIDIA/DCGM/issues/64