NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.59k stars 341 forks source link

cuSPARSELt matmul example not working on M=N=K8192 #203

Open OrenLeung opened 3 months ago

OrenLeung commented 3 months ago

on https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul

the example runs fine on the existing small m,n,k, but unfortunately when i change my m,n,k to be 8192, i get a runtime error. any pointers or patches on how to fix it?

CUSPARSE API failed at line 191 with error: operation not supported (10) https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSELt/matmul/matmul_example.cpp#L116-L118

fbusato commented 3 months ago

@OrenLeung a couple of questions to better understand your issue.

OrenLeung commented 3 months ago

hi @fbusato , thanks for the quick reply.

I didn't change anything else in the code, just the m,n,k vars. I was able to compile & run the matmul example with default m,n,k vars.

fbusato commented 3 months ago

it seems that you are using cuSPARSELt 0.5.2.1 which doesn't support Hopper https://docs.nvidia.com/cuda/cusparselt/release_notes.html libcusparseLt.so.0.5.2.1 My suggestion is to manually download and install the latest version here https://developer.nvidia.com/cusparselt-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

OrenLeung commented 3 months ago

Hi @fbusato

Thanks for the suggestion, I have now correctly symlinked to cuSPARSELt v0.6.2 using your suggestion. I have verifed that the provided m,n,k in the example works properly and does not deadlock.

But unfortunately for m=n=k=8192, I am deadlocked, it seems like it is deadlocked on a half to float convertion __internal_half2float . Strange.

I have also double checked that m,n,k is the only thing i changed.

image

image

fbusato commented 3 months ago

Hi @OrenLeung, the 'deadlock' you observe is due to the long computation time on the host side (correctness) for large matrices. If you want to speed up the process, my suggestion is to use cuBLAS to compute the matrix multiplication on the GPU.

OrenLeung commented 3 months ago

hi @fbusato

Thanks for your suggestion! I have now got it working on but unfortunately the realized TFLOP/s of nowhere close to the peak theoretical sparse TFLOP/s. Do you have any tips on how to improve the cuSPARSE performance?

realized sparse cuSPARSELt fp16: 1005 TFLOP/s out of the peak theoretical 1,979 realized dense cuBLAS fp16: 870 TFLOP/s out of the peak theoretical 979.5

this menas there is only around a 15% realized improvement. Although no one was expecting the claimed 2x imrpovement, one would expect closer to a 40-50% realized improvement. On A100, Nvidia claims that the speed up for big GEMMs is 1.6-1.8x https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/

Attached is my script to benchmarking 8192x8192x8192 cuSPARSE 2:4 semi structured 16 sparsity vs cuBLAS fp16 dense gemms on h100. I have ensured that I am benchmarking gpu time through cudaevents and i am on the latest cuSPARSE version. https://github.com/OrenLeung/CUDALibrarySamples/commit/e3cfb07e6b6625ec33b8526d82bebd5a21185624

fbusato commented 3 months ago

there are several things to consider when benchmarking cuSPARSELt. You should nsight-system (or cupti) to get more reliable time measurement. Second, you need to run the autotuning functionality, see the other example. Other points to consider: run some warm-up runs, lock gpu sm/memory clock, disable autoboost, ensure there is no power/thermal throttling, disable cpu turboboost, set cpu governor to performance, etc.

OrenLeung commented 3 months ago

hi fbusato,

thanks for your suggestion.

  1. I believe i am already running the autotuning function cusparseLtMatmulSearch. is there another function that I am missing? https://github.com/OrenLeung/CUDALibrarySamples/blob/e3cfb07e6b6625ec33b8526d82bebd5a21185624/cuSPARSELt/matmul/matmul_example.cpp#L348
  2. i have already locked the gpu clock speed sudo nvidia-smi -i 0 --lock-gpu-clocks=1830,1830
  3. as you may be aware, due to throttling, the TFLOP/s get every worst after (i have included a time.sleep between benchmarking sparse and dense to allow the gpu to cool down). Even with warmup, the perf delta is still around 15%

image

OrenLeung commented 3 months ago

It seems when changing the inputs to a normal distribution centered around 0, then the sparse performance gets a bit better with 20% improvement over dense. https://github.com/OrenLeung/CUDALibrarySamples/commit/9cabba4b1154f2c49037d89171d41c31b6033c79

# median of 5000 iterations with removing first 100 iterations 
Dense Median: 642.971 TFLOP/s
Sparse Median: 768.348 TFLOP/s

image

fbusato commented 3 months ago

@OrenLeung we evaluated the same sparse GEMM operation on our systems, default clocks. We observed 1.38x speedup (sparse vs. dense) on a H100 350W and 1.22x on H100 800W.

OrenLeung commented 3 months ago

@fbusato thanks for running it. by "800W h100", you mean 700W right? we also see around 1.20-1.22x improvement too.

Would you have any suggestions on shapes where sparsity would show the biggest gain compared to dense?

fbusato commented 3 months ago

I don't have any specific suggestions other than to try different shapes and data types. The results are affected by different GPU models, clock settings, and cuda version, so it is hard to give exact sizes. The main engineer is OOTO, and he will be back in 2w. He can help you better