NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

sm_stress test is missing from dcgm-3.2.3 #98

Open bsteinb opened 1 year ago

bsteinb commented 1 year ago

The sm_stress test seems to be missing from the latest release. When trying to run it explicitly, DCGM complains:

$ dcgmi diag -i 0,1,2,3 -v -r sm_stress --fail-early -p "sm_stress.target_stress=17000"
Invalid Parameter String: test 'sm_stress' does not match any loaded tests. Check logs for plugin failures.

The corresponding shared objects are no longer part of the RPM:

/usr/share/nvidia-validation-suite/plugins/cuda12/libSmStress.so
/usr/share/nvidia-validation-suite/plugins/cuda12/libSmStress.so.3
/usr/share/nvidia-validation-suite/plugins/cuda12/libSmStress.so.3.1.8

The release notes for version 3.1.3 mention that sm_stress is no longer run as part of diagnostic levels 3 or 4, but do not mention the test being removed in 3.2.3.

The sources for version 3.2.3 have not been exported to GitHub yet.

(As a side note, the documentation for DCGM Diagnostics contradict the release notes, since they list sm_stress as still being part of diagnostics levels 3 and 4.)

glowkey commented 1 year ago

The "sm_stress" test was deprecated in 3.1.3 because its functionality is superseded by the "diagnostic" test. It was removed in 3.2.3. The "diagnostic" test is the recommended replacement.