Open hljjjmssyh opened 12 months ago
Hi @hljjjmssyh ,
Can you share more details, please? For example, can you share the command-lines to build and run the models, please?
Thanks, Julien
@jdemouth-nvidia python build.py --model_version v2_7b \ --model_dir baichuan2-7b \ --dtype float16 \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --use_weight_only \ --weight_only_precision int4 \ --output_dir ./tmp/baichuan_v2_7b/trt_engines/int4_weight_only/1-gpu/
here is the command-lines to build.
Baichuan-int8, Baichuan-int4, and Baichuan2-int8 seem to work well, but Baichuan2-int4 is very slow.
@hljjjmssyh
Hi, can you share the full build/run command for both INT8 and INT4 workflow in your environment? It can make us easier to reproduce the issue.
Thanks June
Any update?
nvidia A100 same request int8 model takes 200~ms but int4 model takes 2.4s