hcy682 commented 6 months ago

Environment

GPU properties
- GPU name NVIDIA A800
- GPU memory size 80GB
Libraries
- TensorRT-LLM branch or tag 0.10.0.dev2024043000
- TensorRT 9.3.0.post12.dev1
- AMMO 0.9.3
- CUDA 12.2
- transformers 4.38.2
- torch 2.2.2
- accelerate 0.25.0
NVIDIA driver version 535.54.03

Reproduction Steps

I want to use TensorRT-LLM to deploy a llama2 model with semi-structured sparsity(2:4) to get inference acceleration. So i use Wanda to implement semi-structured pruning(2:4) to a llama2 model.

Then convert the model into TensorrtLLM checkpoint format: python convert_checkpoint.py --model_dir /mnt/disk1/hcy/wanda/results/wanda/llama2-7b-semi --output_dir tllm_checkpoint/llama2-7b-wanda-2:4 --dtype float16

When compiling model, i set --weight_sparsity to enable the sparsity feature: trtllm-build --checkpoint_dir tllm_checkpoint/llama2-7b-wanda-2:4 --output_dir trt_engines/llama2-7b-wanda-2:4 --gemm_plugin float16 --weight_sparsity

Actual Behavior

The TensorRT logs doesn't report any information about the sparsity.

And when i use benchmark.py to test the pruned model, there is no acceleration:

python /mnt/disk1/hcy/TensorRT-LLM/benchmarks/python/benchmark.py -m llama_7b --mode plugin --batch_size "1" --input_output_len "128,128" --engine_dir trt_engines/llama2-7b-wanda-2:4

model with semi-structured sparsity(2:4) , generation_tokens_per_second 52.829

original model without sparsity, generation_tokens_per_second 52.386

It seems that the --weight_sparsity doesn't work. So how can i get acceleration of semi-structured sparsity using TensorRT-LLM? Thanks!

Additional Notes

I have confirmed that the model has semi-structured sparsity (Among each group of four contiguous values, at least two must be zero) by outputting some weights.

byshiue commented 6 months ago

Please remove the --gemm_plugin float16 because sparsity is not supported by gemm plugin.

hcy682 commented 6 months ago

Please remove the --gemm_plugin float16 because sparsity is not supported by gemm plugin.

Thanks!😊 It works!

wzhuang-xmu commented 6 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

hcy682 commented 6 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

original llama2 model:

unstructured pruning (50%):

semi-structured pruning (2:4):

wzhuang-xmu commented 6 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

original llama2 model:

unstructured pruning (50%):

semi-structured pruning (2:4):

Thank you so much!

aiiAtelier commented 5 months ago

@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.

original llama2 model: unstructured pruning (50%): semi-structured pruning (2:4):

Thank you so much!

Do you see the similar latency reduction under INT8 and FP8 mode?

aiiAtelier commented 5 months ago

Is it just about "gemm_plugin float16" or any gemm_plugin? For instance, when int8 matmul or fp8 matmul is enabled, will sparsity still help?

Please remove the --gemm_plugin float16 because sparsity is not supported by gemm plugin.

Thanks!😊 It works!

aiiAtelier commented 5 months ago

Another question is on the direction of 2:4 sparsity: for example, my weights are something like the following . Does it work or it has to be transposed first before pruning? Thanks.

weight-check tensor([[  0, -21,   0, -35],
        [  0, -46,   0,  29],
        [-22,   0,  48,   0],
        [  0,   0, -46,  33]], dtype=torch.int8)

hcy682 commented 5 months ago

Another question is on the direction of 2:4 sparsity: for example, my weights are something like the following . Does it work or it has to be transposed first before pruning? Thanks.
weight-check tensor([[  0, -21,   0, -35],
        [  0, -46,   0,  29],
        [-22,   0,  48,   0],
        [  0,   0, -46,  33]], dtype=torch.int8)

As long as your weights have semi-structured sparsity (for example 2:4--at least two '0' in four contiguous values), you can get acceleration by setting --weight_sparsity when compiling model. I haven't tried FP8 mode. I have tried INT8 mode, and it is more fast than semi-structured sparsity.

NVIDIA / TensorRT-LLM

Question about weight_sparsity #1559

Environment

Reproduction Steps

Actual Behavior

Additional Notes