Azure / azhpc-images

Azure HPC/AI VM Images
MIT License
90 stars 72 forks source link

CUDA driver version mismatched with CUDA runtime version #343

Open loligans opened 1 month ago

loligans commented 1 month ago

The GPU Driver is using CUDA 12.2 but the CUDA runtime installed (nvcc) is 12.4

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000001:00:00.0 Off |                    0 |
| N/A   28C    P0              76W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

The mismatch of CUDA versions causes GPU_Burn to hang. I believe the GPU driver should be updated to 550.54.15

If the intended CUDA version is 12.2 then the GPU driver can remain as 535.161.08, but the CUDA runtime should be downgraded to 12.2

If the intended CUDA version is 12.4 then the GPU driver should be updated to 550.54.15

Related issue: https://github.com/wilicc/gpu-burn/issues/7

LiquidPT commented 1 month ago

There were issues with Fabric Manager 550.54.15, so we had to revert FM and the GPU driver. As per NVIDIA, this version of CUDA should be compatible with the GPU driver:

https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-comaptibility

CUDA 12.4 has some critical fixes, so using the newer version is preferable.