NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.67k stars 989 forks source link

Question about weight_sparsity #1559

Closed hcy682 closed 6 months ago

hcy682 commented 6 months ago

Environment

Reproduction Steps

I want to use TensorRT-LLM to deploy a llama2 model with semi-structured sparsity(2:4) to get inference acceleration. So i use Wanda to implement semi-structured pruning(2:4) to a llama2 model.

Then convert the model into TensorrtLLM checkpoint format: python convert_checkpoint.py --model_dir /mnt/disk1/hcy/wanda/results/wanda/llama2-7b-semi --output_dir tllm_checkpoint/llama2-7b-wanda-2:4 --dtype float16

When compiling model, i set --weight_sparsity to enable the sparsity feature: trtllm-build --checkpoint_dir tllm_checkpoint/llama2-7b-wanda-2:4 --output_dir trt_engines/llama2-7b-wanda-2:4 --gemm_plugin float16 --weight_sparsity

Actual Behavior

The TensorRT logs doesn't report any information about the sparsity.

image

And when i use benchmark.py to test the pruned model, there is no acceleration:

python /mnt/disk1/hcy/TensorRT-LLM/benchmarks/python/benchmark.py -m llama_7b --mode plugin --batch_size "1" --input_output_len "128,128" --engine_dir trt_engines/llama2-7b-wanda-2:4

model with semi-structured sparsity(2:4) , generation_tokens_per_second 52.829

image

original model without sparsity, generation_tokens_per_second 52.386

image

It seems that the --weight_sparsity doesn't work. So how can i get acceleration of semi-structured sparsity using TensorRT-LLM? Thanks!

Additional Notes

I have confirmed that the model has semi-structured sparsity (Among each group of four contiguous values, at least two must be zero) by outputting some weights.

byshiue commented 6 months ago

Please remove the --gemm_plugin float16 because sparsity is not supported by gemm plugin.

hcy682 commented 6 months ago

Please remove the --gemm_plugin float16 because sparsity is not supported by gemm plugin.

Thanks!😊 It works!

wzhuang-xmu commented 6 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

hcy682 commented 6 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

original llama2 model: image

unstructured pruning (50%): image

semi-structured pruning (2:4): image

wzhuang-xmu commented 6 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

original llama2 model: image

unstructured pruning (50%): image

semi-structured pruning (2:4): image

Thank you so much!

aiiAtelier commented 5 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

original llama2 model: image unstructured pruning (50%): image semi-structured pruning (2:4): image

Thank you so much!

Do you see the similar latency reduction under INT8 and FP8 mode?

aiiAtelier commented 5 months ago

Is it just about "gemm_plugin float16" or any gemm_plugin? For instance, when int8 matmul or fp8 matmul is enabled, will sparsity still help?

Please remove the --gemm_plugin float16 because sparsity is not supported by gemm plugin.

Thanks!😊 It works!

aiiAtelier commented 5 months ago

Another question is on the direction of 2:4 sparsity: for example, my weights are something like the following . Does it work or it has to be transposed first before pruning? Thanks.

weight-check tensor([[  0, -21,   0, -35],
        [  0, -46,   0,  29],
        [-22,   0,  48,   0],
        [  0,   0, -46,  33]], dtype=torch.int8)
hcy682 commented 5 months ago

Another question is on the direction of 2:4 sparsity: for example, my weights are something like the following . Does it work or it has to be transposed first before pruning? Thanks.

weight-check tensor([[  0, -21,   0, -35],
        [  0, -46,   0,  29],
        [-22,   0,  48,   0],
        [  0,   0, -46,  33]], dtype=torch.int8)

As long as your weights have semi-structured sparsity (for example 2:4--at least two '0' in four contiguous values), you can get acceleration by setting --weight_sparsity when compiling model. I haven't tried FP8 mode. I have tried INT8 mode, and it is more fast than semi-structured sparsity.