Open OrenLeung opened 4 months ago
@OrenLeung a couple of questions to better understand your issue.
hi @fbusato , thanks for the quick reply.
I didn't change anything else in the code, just the m,n,k vars. I was able to compile & run the matmul example with default m,n,k vars.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install libcusparselt0 libcusparselt-dev
I have double checked that my cusparse .so
is at my cuda_home
ls /usr/local/cuda/lib64/libcusparse
libcusparse.so libcusparseLt.so libcusparseLt_static.a
libcusparse.so.12 libcusparseLt.so.0
libcusparse.so.12.5.1.3 libcusparseLt.so.0.5.2.1
it seems that you are using cuSPARSELt 0.5.2.1 which doesn't support Hopper https://docs.nvidia.com/cuda/cusparselt/release_notes.html libcusparseLt.so.0.5.2.1
My suggestion is to manually download and install the latest version here https://developer.nvidia.com/cusparselt-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local
Hi @fbusato
Thanks for the suggestion, I have now correctly symlinked to cuSPARSELt v0.6.2 using your suggestion. I have verifed that the provided m,n,k in the example works properly and does not deadlock.
But unfortunately for m=n=k=8192, I am deadlocked, it seems like it is deadlocked on a half to float convertion __internal_half2float
. Strange.
I have also double checked that m,n,k is the only thing i changed.
Hi @OrenLeung, the 'deadlock' you observe is due to the long computation time on the host side (correctness) for large matrices. If you want to speed up the process, my suggestion is to use cuBLAS to compute the matrix multiplication on the GPU.
hi @fbusato
Thanks for your suggestion! I have now got it working on but unfortunately the realized TFLOP/s of nowhere close to the peak theoretical sparse TFLOP/s. Do you have any tips on how to improve the cuSPARSE performance?
realized sparse cuSPARSELt fp16: 1005 TFLOP/s out of the peak theoretical 1,979 realized dense cuBLAS fp16: 870 TFLOP/s out of the peak theoretical 979.5
this menas there is only around a 15% realized improvement. Although no one was expecting the claimed 2x imrpovement, one would expect closer to a 40-50% realized improvement. On A100, Nvidia claims that the speed up for big GEMMs is 1.6-1.8x https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/
Attached is my script to benchmarking 8192x8192x8192 cuSPARSE 2:4 semi structured 16 sparsity vs cuBLAS fp16 dense gemms on h100. I have ensured that I am benchmarking gpu time through cudaevents and i am on the latest cuSPARSE version. https://github.com/OrenLeung/CUDALibrarySamples/commit/e3cfb07e6b6625ec33b8526d82bebd5a21185624
there are several things to consider when benchmarking cuSPARSELt. You should nsight-system (or cupti) to get more reliable time measurement. Second, you need to run the autotuning functionality, see the other example. Other points to consider: run some warm-up runs, lock gpu sm/memory clock, disable autoboost, ensure there is no power/thermal throttling, disable cpu turboboost, set cpu governor to performance, etc.
hi fbusato,
thanks for your suggestion.
cusparseLtMatmulSearch
. is there another function that I am missing? https://github.com/OrenLeung/CUDALibrarySamples/blob/e3cfb07e6b6625ec33b8526d82bebd5a21185624/cuSPARSELt/matmul/matmul_example.cpp#L348sudo nvidia-smi -i 0 --lock-gpu-clocks=1830,1830
It seems when changing the inputs to a normal distribution centered around 0, then the sparse performance gets a bit better with 20% improvement over dense. https://github.com/OrenLeung/CUDALibrarySamples/commit/9cabba4b1154f2c49037d89171d41c31b6033c79
# median of 5000 iterations with removing first 100 iterations
Dense Median: 642.971 TFLOP/s
Sparse Median: 768.348 TFLOP/s
@OrenLeung we evaluated the same sparse GEMM operation on our systems, default clocks. We observed 1.38x speedup (sparse vs. dense) on a H100 350W and 1.22x on H100 800W.
@fbusato thanks for running it. by "800W h100", you mean 700W right? we also see around 1.20-1.22x improvement too.
Would you have any suggestions on shapes where sparsity would show the biggest gain compared to dense?
I don't have any specific suggestions other than to try different shapes and data types. The results are affected by different GPU models, clock settings, and cuda version, so it is hard to give exact sizes. The main engineer is OOTO, and he will be back in 2w. He can help you better
on https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul
the example runs fine on the existing small m,n,k, but unfortunately when i change my m,n,k to be 8192, i get a runtime error. any pointers or patches on how to fix it?
CUSPARSE API failed at line 191 with error: operation not supported (10)
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSELt/matmul/matmul_example.cpp#L116-L118