NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

On the missing latency reduction with "--weight_sparsity" and how sparsity gets along w. quantization #1802

Closed aiiAtelier closed 1 week ago

aiiAtelier commented 1 week ago

System Info

GPU: A10 and H100 TensorRT-LLM 0.9.0

Who can help?

@Tracin @kaiyux @byshiue

Information

Tasks

Reproduction

Using the following commands to convert the ckpt and build the engine with or without --weight_sparsity, I'm getting the same latency numbers...

python convert_checkpoint.py --model_dir <path-to-model-7b-llama> --output_dir <path-to-model-7b-llama-ckpt> --dtype float16

# sparse engine and benchmark
trtllm-build --checkpoint_dir <path-to-model-7b-llama-ckpt>  --output_dir <path-to-model-7b-llama-trt-sparse-engine>  --weight_sparsity

python <path-to-TensorRT-LLM>/benchmarks/python/benchmark.py  -m llama_7b  --mode plugin  --batch_size "1"  --input_output_len "128,128"  --engine_dir <path-to-model-7b-llama-trt-sparse-engine>

# dense engine and benchmark
trtllm-build --checkpoint_dir <path-to-model-7b-llama-ckpt>  --output_dir <path-to-model-7b-llama-trt-dense-engine> 

python <path-to-TensorRT-LLM>/benchmarks/python/benchmark.py  -m llama_7b  --mode plugin  --batch_size "1"  --input_output_len "128,128"  --engine_dir <path-to-model-7b-llama-trt-dense-engine>

Expected behavior

# Dense
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 34.97 percentile95(ms) 3661.848 percentile99(ms) 3661.848 latency(ms) 3659.924 compute_cap sm86 quantization QuantMode.0 generation_time(ms) 3615.029 total_generated_tokens 127.0 generation_tokens_per_second 35.131

# Sparse
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 34.97 percentile95(ms) 3661.848 percentile99(ms) 3661.848 latency(ms) 3659.924 compute_cap sm86 quantization QuantMode.0 generation_time(ms) **2815.029** total_generated_tokens 127.0 generation_tokens_per_second **28.131**

actual behavior

# Dense
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 34.97 percentile95(ms) 3661.848 percentile99(ms) 3661.848 latency(ms) 3659.924 compute_cap sm86 quantization QuantMode.0 generation_time(ms) 3615.029 total_generated_tokens 127.0 generation_tokens_per_second 35.131

# Sparse
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 50000 precision float16 batch_size 1 input_length 128 output_length 128 gpu_peak_mem(gb) 13.609 build_time(s) 0 tokens_per_sec 35.05 percentile95(ms) 3653.556 percentile99(ms) 3653.556 latency(ms) 3652.162 compute_cap sm86 quantization QuantMode.0 generation_time(ms) 3610.4 total_generated_tokens 127.0 generation_tokens_per_second 35.176

additional notes

I'm also wondering if "--weight_sparsity" goes well with INT8 weight/activation and FP8 weight/activation (on H100). Thanks.

hijkzzz commented 1 week ago

from TRT-LLM engineer: Recently, I have been testing the sparse performance of TRT-LLM. Indeed, both BF16 and FP8 have acceleration ratios within 5% (GPT3-843M).

Related issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1731