NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.53k stars 969 forks source link

baichuan2 7b. int4 slower than int8 #244

Open hljjjmssyh opened 12 months ago

hljjjmssyh commented 12 months ago

nvidia A100 same request int8 model takes 200~ms but int4 model takes 2.4s

jdemouth-nvidia commented 12 months ago

Hi @hljjjmssyh ,

Can you share more details, please? For example, can you share the command-lines to build and run the models, please?

Thanks, Julien

hljjjmssyh commented 12 months ago

@jdemouth-nvidia python build.py --model_version v2_7b \ --model_dir baichuan2-7b \ --dtype float16 \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --use_weight_only \ --weight_only_precision int4 \ --output_dir ./tmp/baichuan_v2_7b/trt_engines/int4_weight_only/1-gpu/

here is the command-lines to build.

Baichuan-int8, Baichuan-int4, and Baichuan2-int8 seem to work well, but Baichuan2-int4 is very slow.

juney-nvidia commented 12 months ago

@hljjjmssyh

Hi, can you share the full build/run command for both INT8 and INT4 workflow in your environment? It can make us easier to reproduce the issue.

Thanks June

byshiue commented 10 months ago

Any update?