NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

What are the detailed meanings of some test items in DCGM Diag.cpp? #96

Open irvingans opened 1 year ago

irvingans commented 1 year ago

Hi, I have some lack of understandings of some test items in https://github.com/NVIDIA/DCGM/blob/master/dcgmi/Diag.cpp. Are there any detailed explanations about the test items below?

  1. Permissions and OS Blocks
  2. Persistence Mode
  3. Page Retirement/Row Remap
  4. Targeted stress.

And why do we need tests regarding these items?

  1. Environment Variables
    ### Tasks
nikkon-dev commented 1 year ago

@irvingans,

The dcgmi/Diag.cpp is just a launcher for the nvvs binary. Please take a look at this code

The software nvvs plugin checks if any known preconditions may affect/slow down further tests. Usually, if the software tests fail, the system is misconfigured/requires a reboot, and should not be used in production.

nikkon-dev commented 1 year ago

Some description is also available here

irvingans commented 1 year ago

Hi @nikkon-dev , thanks for the explanations. And now I have some more doubts regarding line 565-565 in https://github.com/NVIDIA/DCGM/blob/master/sdk/nvidia/nvml/nvml.h : image

  1. How does these factors(sync boost, board limit, low utilization, board reliability limit) cause the GPU to be below application clocks ?
  2. What is the detailed meaning of board reliability limit ?
irvingans commented 1 year ago

Hi @aaronp24 , @lukeyeager , @aflat , any ideas for my questions above?