Closed hcy682 closed 6 months ago
Please remove the --gemm_plugin float16
because sparsity is not supported by gemm plugin.
Please remove the
--gemm_plugin float16
because sparsity is not supported by gemm plugin.
Thanks!😊 It works!
@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.
@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.
original llama2 model:
unstructured pruning (50%):
semi-structured pruning (2:4):
@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.
original llama2 model:
unstructured pruning (50%):
semi-structured pruning (2:4):
Thank you so much!
@hcy682 hi, I'm also trying to use tensorrt-llm to speed up sparse llama. How much can llama2-7b model with semi-structured sparsity(2:4) be accelerated? Can you tell me? It would be greatly appreciated if you could.
original llama2 model: unstructured pruning (50%): semi-structured pruning (2:4):
Thank you so much!
Do you see the similar latency reduction under INT8 and FP8 mode?
Is it just about "gemm_plugin float16" or any gemm_plugin? For instance, when int8 matmul or fp8 matmul is enabled, will sparsity still help?
Please remove the
--gemm_plugin float16
because sparsity is not supported by gemm plugin.Thanks!😊 It works!
Another question is on the direction of 2:4 sparsity: for example, my weights are something like the following . Does it work or it has to be transposed first before pruning? Thanks.
weight-check tensor([[ 0, -21, 0, -35],
[ 0, -46, 0, 29],
[-22, 0, 48, 0],
[ 0, 0, -46, 33]], dtype=torch.int8)
Another question is on the direction of 2:4 sparsity: for example, my weights are something like the following . Does it work or it has to be transposed first before pruning? Thanks.
weight-check tensor([[ 0, -21, 0, -35], [ 0, -46, 0, 29], [-22, 0, 48, 0], [ 0, 0, -46, 33]], dtype=torch.int8)
As long as your weights have semi-structured sparsity (for example 2:4--at least two '0' in four contiguous values), you can get acceleration by setting --weight_sparsity when compiling model. I haven't tried FP8 mode. I have tried INT8 mode, and it is more fast than semi-structured sparsity.
Environment
Reproduction Steps
I want to use TensorRT-LLM to deploy a llama2 model with semi-structured sparsity(2:4) to get inference acceleration. So i use Wanda to implement semi-structured pruning(2:4) to a llama2 model.
Then convert the model into TensorrtLLM checkpoint format:
python convert_checkpoint.py --model_dir /mnt/disk1/hcy/wanda/results/wanda/llama2-7b-semi --output_dir tllm_checkpoint/llama2-7b-wanda-2:4 --dtype float16
When compiling model, i set
--weight_sparsity
to enable the sparsity feature:trtllm-build --checkpoint_dir tllm_checkpoint/llama2-7b-wanda-2:4 --output_dir trt_engines/llama2-7b-wanda-2:4 --gemm_plugin float16 --weight_sparsity
Actual Behavior
The TensorRT logs doesn't report any information about the sparsity.
And when i use
benchmark.py
to test the pruned model, there is no acceleration:python /mnt/disk1/hcy/TensorRT-LLM/benchmarks/python/benchmark.py -m llama_7b --mode plugin --batch_size "1" --input_output_len "128,128" --engine_dir trt_engines/llama2-7b-wanda-2:4
model with semi-structured sparsity(2:4) , generation_tokens_per_second 52.829
original model without sparsity, generation_tokens_per_second 52.386
It seems that the
--weight_sparsity
doesn't work. So how can i get acceleration of semi-structured sparsity using TensorRT-LLM? Thanks!Additional Notes
I have confirmed that the model has semi-structured sparsity (Among each group of four contiguous values, at least two must be zero) by outputting some weights.