NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

When I run diagnostics, the two GPUs in the group both get failed results. #145

Open BetaZYN opened 6 months ago

BetaZYN commented 6 months ago
  1. When I run diagnostics on GPU0 alone, it will fail. $ dcgmi diag -r 2 -g 7 image
  2. When I run diagnostics on GPU1 alone, the diagnostics result is normal.(long level is also normal) $ dcgmi diag -r 2 -g 8 image
  3. When I run diagnostics on GPU0 and GPU1 in a group, both GPUs will fail. It looks like if GPU0 fails, it will not run diagnostics on GPU1. Why is that happening? $ dcgmi diag -r 2 -g 0 image $ dcgmi diag -r 3 -g 0 image
nvvfedorov commented 6 months ago

@BetaZYN , The best place to report the issue you see is here: https://github.com/nVIDIA/dcgm.

nikkon-dev commented 6 months ago

@BetaZYN,

Can you please explain why you think the GPU1 diagnostic isn't running? In the attached screenshot, the GPU1 results appear in the output.