INT8 appears to be hung on NVIDIA H100 NVL

I keep running into this INT8 test appearing to hang, terminating stress testing early. We've seen this across multiple of the same nodes, all with : 4x H100 NVL , Rhel 8.8, CUDA 12.6.

Here's the area in the log where it attempts to start the INT8 test but then appears to hang. Any assistance would be appreciated! It does always seem to be this specific INT8 test, following the FP32 cublasLt test on the final device (device 3).

***** STARTING TEST 4: FP32 On Device 0 NVIDIA H100 NVL testing cublasLt Allocate matrixSize Total Bytes A + B + C: 40051774208

args: ta=N tb=T m=105712 n=34472 k=45432 alpha = (0x3f800000, 1) beta= (0x00000000, 0)

args: lda=105712 ldb=34472 ldc=105712 ldd=105712 loop=1

^^^^ CUDA : elapsed = 7.49945 sec, Gflops = 44152.295 testing cublasLt pass ***** TEST FP32 On Device 0 NVIDIA H100 NVL Device 1: "NVIDIA H100 NVL", PCIe: 59 stress_tests[0].test_name INT8 P bisb_imma m 355218 n 5263 k 52437 ta 1 tb 0 B 0

STARTING TEST 0: INT8 On Device 1 NVIDIA H100 NVL TEST INT8 On Device 1 NVIDIA H100 NVL stress_tests[1].test_name FP16 P hsh m 5928 n 6944 k 2144 ta 0 tb 1 B 0

***** STARTING TEST 1: FP16 On Device 1 NVIDIA H100 NVL testing cublasLt Allocate matrixSize Total Bytes A + B + C: 137523200

args: ta=N tb=T m=5928 n=6944 k=2144 alpha = (0x3f800000, 1) beta= (0x00000000, 0)

args: lda=5928 ldb=6944 ldc=5928 ldd=5928 loop=1

^^^^ CUDA : elapsed = 0.000304544 sec, Gflops = 579592.335 testing cublasLt pass ***** TEST FP16 On Device 1 NVIDIA H100 NVL stress_tests[2].test_name TF32 P sss_fast_tf32 m 105712 n 34472 k 45432 ta 1 tb 0 B 0

***** STARTING TEST 2: TF32 On Device 1 NVIDIA H100 NVL testing cublasLt Allocate matrixSize Total Bytes A + B + C: 40051774208

args: ta=T tb=N m=105712 n=34472 k=45432 alpha = (0x3f800000, 1) beta= (0x00000000, 0)

args: lda=45432 ldb=45432 ldc=105712 ldd=105712 loop=1

TEST INT8 appears to be hung Terminating stress testing... WATCHDOG thread exiting.... ^^^^ CUDA : elapsed = 7.99028 sec, Gflops = 41440.095 testing cublasLt pass TEST TF32 On Device 1 NVIDIA H100 NVL TEST FAILED ****

We have enterprise support - this was verified by the NVIDIA team as reproducible on their end; but this codebase is not part of the support package. The team advised us to switch to stress testing through DCGM:

DCGM: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html

The interesting part for you may be the diagnostics, as opposed to the monitoring, which can be found here: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html

A couple of important commands: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html

dcgmi discovery -l

This lists the available GPUs

dcgmi diag -r

This will run workloads on the available GPUs, and will provide a health summary at the end of the process

NVIDIA / GPUStressTest