NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

[Question]: how to detect GPUs with low compute performance #122

Closed dmonakhov closed 6 months ago

dmonakhov commented 10 months ago

Hi, We have a broken GPU-0 A100-40G/DGX which shows compute performance ~x3 times less than expected. 6TFlops instead of 19TF for FP32. It was very surprised to see that it was not caught by dcgm diagnostics --run xlong, it completes w/o any issue. Likely I'm doing something wrong. Which test is responsible for validating compute performance? Likely we missed some config with expected performance?

BTW broken GPU can be easily detected by ubergemm binary.

time /usr/share/nvidia-validation-suite/plugins/cuda12/ubergemm --adapters 0,1,2,3,4,5,6,7 --time_to_run 10 --internal_loops=100 | grep GFlops
GEMM [I] (ADAPTER#0)  Ending with 6698.442948608 GFlops
GEMM [I] (ADAPTER#2)  Ending with 19239.381499904 GFlops
GEMM [I] (ADAPTER#1)  Ending with 19148.107153408 GFlops
GEMM [I] (ADAPTER#3)  Ending with 19256.40617984 GFlops
GEMM [I] (ADAPTER#4)  Ending with 19190.677241856 GFlops
GEMM [I] (ADAPTER#5)  Ending with 19096.020189184 GFlops
GEMM [I] (ADAPTER#6)  Ending with 19272.568930304 GFlops
GEMM [I] (ADAPTER#7)  Ending with 19186.795413504 GFlops 
nikkon-dev commented 10 months ago

@dmonakhov,

Could you run /usr/share/nvidia-validation-suite/nvvs --specifiedtest xlong -d debug and provide the nvvs.log?

dmonakhov commented 10 months ago

@nikkon-dev Please see logs attached: nvvs_--specifiedtestxlong-d_debug.txt nvidia-dcgm-logs.tar.gz

From source code https://github.com/NVIDIA/DCGM/blob/master/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp#L723-L734 we indeed dump compute power estimation, but it never validated to any golden value. Am I missing something? If compute validation is missing I would like to send a patch. There are two options how to fix:

dmonakhov commented 9 months ago

@nikkon-dev Ping, are you able to take a look.

dbeer commented 9 months ago

@dmonakhov - we would be happy to accept either of those patches. I can consult with people internally to see if we are okay having the deviation option enabled by default, but yes, this is a fine test condition to add for GPUs.

shnv2023 commented 7 months ago

@dmonakhov, a change has been introduced which adds the gflops_tolerance_pcnt parameter to the diagnostic plugin. After the gpuburn test is run on each gpu on the host, the average is determined. If a given GPU is more than gflops_tolerance_pcnt below that average, an error will be reported. Barring unforeseen issues, this change seems likely to be included in the next major release of DCGM.

Example Output (could change between now and release):

2024-01-09 18:07:17.184 WARN  [10329:10329] plugin diagnostic: Detected 129.13 GFLOPs for GPU 2 which is below the threshold 132.66
dmonakhov commented 7 months ago

@shnv2023 Can you please point a commit-id, it seems it was not published yet. At least I can not find it in latest HEAD.

shnv2023 commented 7 months ago

@shnv2023 Can you please point a commit-id, it seems it was not published yet. At least I can not find it in latest HEAD.

@dmonakhov, the change will not be made public until it is released.

shnv2023 commented 6 months ago

@dmonakhov, release 3.3.5 introduces a new diagnostic parameter, diagnostic.gflops_tolerance_pcnt, which will report an error if one or more GPU's gflops are not within a specified percentage of the mean.

For example, to run the diagnostic, reporting an error if a GPU reports gflops not within 60% of the mean gflops across all GPUs: dcgmi diag -r 3 -p diagnostic.gflops_tolerance_pcnt=0.60

dmonakhov commented 6 months ago

ACK. In my case failures looks like follows on a bad GPU.

{
                                                                "gpu_id" : "5",
                                                                "info" : "GPU 5 Allocated space for 137 output matricies from 75949395148 bytes available., GPU 5 Running with precisions: FP64 1, FP32 1, FP16 1, 
GPU 5 GPU 5 calculated at approximately 511.70 gigaflops during this test",
                                                                "status" : "Fail",
                                                                "warnings" : 
                                                                [
                                                                        {
                                                                                "error_category" : 2,
                                                                                "error_id" : 110,
                                                                                "error_severity" : 4,
                                                                                "warning" : "GPU 5 Detected 511.70 GFLOPs for GPU 5 which is below the threshold 1480.77 Please verify your user-specified variance
 tolerance is set appropriately; if so, and if errors are persistent, please run a field diagnostic."
                                                                        },
                                                                        {
                                                                                "error_category" : 15,
                                                                                "error_id" : 43,
                                                                                "error_severity" : 1,
                                                                                "warning" : "GPU 5 Clocks are being throttled for GPU 5 because of clock throttling starting 4.7 seconds into the test. clocks_thro
ttle_reason_sw_thermal_slowdown: the GPU or its memory have reached unsafe temperatures. Check DCGM and system logs for errors. Reset GPU. Restart DCGM. Rerun diagnostics."
                                                                        }
                                                                ]
                                                        },