On the missing latency reduction with "--weight_sparsity" and how sparsity gets along w. quantization

System Info

GPU: A10 and H100 TensorRT-LLM 0.9.0

Who can help?

@Tracin @kaiyux @byshiue

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Using the following commands to convert the ckpt and build the engine with or without --weight_sparsity, I'm getting the same latency numbers...

python convert_checkpoint.py --model_dir <path-to-model-7b-llama> --output_dir <path-to-model-7b-llama-ckpt> --dtype float16

# sparse engine and benchmark
trtllm-build --checkpoint_dir <path-to-model-7b-llama-ckpt>  --output_dir <path-to-model-7b-llama-trt-sparse-engine>  --weight_sparsity

python <path-to-TensorRT-LLM>/benchmarks/python/benchmark.py  -m llama_7b  --mode plugin  --batch_size "1"  --input_output_len "128,128"  --engine_dir <path-to-model-7b-llama-trt-sparse-engine>

# dense engine and benchmark
trtllm-build --checkpoint_dir <path-to-model-7b-llama-ckpt>  --output_dir <path-to-model-7b-llama-trt-dense-engine> 

python <path-to-TensorRT-LLM>/benchmarks/python/benchmark.py  -m llama_7b  --mode plugin  --batch_size "1"  --input_output_len "128,128"  --engine_dir <path-to-model-7b-llama-trt-dense-engine>

Expected behavior

# Dense
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 34.97 percentile95(ms) 3661.848 percentile99(ms) 3661.848 latency(ms) 3659.924 compute_cap sm86 quantization QuantMode.0 generation_time(ms) 3615.029 total_generated_tokens 127.0 generation_tokens_per_second 35.131

# Sparse
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 34.97 percentile95(ms) 3661.848 percentile99(ms) 3661.848 latency(ms) 3659.924 compute_cap sm86 quantization QuantMode.0 generation_time(ms) **2815.029** total_generated_tokens 127.0 generation_tokens_per_second **28.131**

actual behavior

# Dense
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 34.97 percentile95(ms) 3661.848 percentile99(ms) 3661.848 latency(ms) 3659.924 compute_cap sm86 quantization QuantMode.0 generation_time(ms) 3615.029 total_generated_tokens 127.0 generation_tokens_per_second 35.131

# Sparse
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 35.05 percentile95(ms) 3653.556 percentile99(ms) 3653.556 latency(ms) 3652.162 compute_cap sm86 quantization QuantMode.0 generation_time(ms) 3610.4 total_generated_tokens 127.0 generation_tokens_per_second 35.176

additional notes

I'm also wondering if "--weight_sparsity" goes well with INT8 weight/activation and FP8 weight/activation (on H100). Thanks.

NVIDIA / TensorRT-LLM

On the missing latency reduction with "--weight_sparsity" and how sparsity gets along w. quantization #1802