Closed aiiAtelier closed 1 week ago
from TRT-LLM engineer: Recently, I have been testing the sparse performance of TRT-LLM. Indeed, both BF16 and FP8 have acceleration ratios within 5% (GPT3-843M).
Related issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1731
System Info
GPU: A10 and H100 TensorRT-LLM 0.9.0
Who can help?
@Tracin @kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Using the following commands to convert the ckpt and build the engine with or without
--weight_sparsity
, I'm getting the same latency numbers...python convert_checkpoint.py --model_dir <path-to-model-7b-llama> --output_dir <path-to-model-7b-llama-ckpt> --dtype float16
Expected behavior
actual behavior
additional notes
I'm also wondering if "--weight_sparsity" goes well with INT8 weight/activation and FP8 weight/activation (on H100). Thanks.