IST-DASLab / sparsegpt

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
https://arxiv.org/abs/2301.00774
Apache License 2.0
694 stars 88 forks source link

How should I verify the speedup effect of the algorithm? #15

Open moonlightian opened 1 year ago

moonlightian commented 1 year ago

As shown in paper, CUTLASS library is used for speedup. But I did not find codes rely on these settlement.How should I verify SparseGPT is faster than dense models when doing inference? Even with end-to-end, speedups would be slightly lower, that would be fine. Thanks a lot for your perfect works~

efrantar commented 1 year ago

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

moonlightian commented 1 year ago

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

Thank you for your kind reply~

moonlightian commented 1 year ago

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

@efrantar Hi, following your introducement, I prepare an environment for NVIDIA's CUTLASS profiler and compiled kernels with official guide. As for "Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch" mentioned above, I'm confused about how to make it work. Would that be convenient for you to offer some code for speedup testing? Or some links to NVIDIA related demo would be fine too. Thanks again

kiucho commented 1 year ago

Hi, I'm someone who wants to validate the speedup of 2:4 sparsification and density models. As I understand it, to properly utilize SPMM (sparse matrix and dense matrix multiplication) on Nvidia's ampere architecture GPUs(like A6000 or A100), it is necessary to implement the cuSPARSELt library within Pytorch, which I think they are working on (cuSPARSELt Integration). I have a few questions about this.

  1. Does SparseGPT use the CUTLASS library only for speedup measurement, or does it also use it to approximate cuSPARSELt to do SPMM?

  2. Finally, implementing a profiler within Pytorch seems to be a complex task that requires a deep understanding of both the Pytorch framework and the profiler. I would also be grateful if I could get the profiler and code for speedup.

I look forward to hearing from you. Thank you.