Open vip-china opened 1 month ago
Do you encounter same issue on LLaMA 2-70B?
Do you encounter same issue on LLaMA 2-70B?
The current test is llama3, and llama-2-70B has not been tested before. Is this related to int4/awq, FP16 is normal
Because we don't observe such issue on llama-2-70B with int4-awq, and we don't have llama-3-70B ckpt now. So, we hope to find an baseline model as reference to help reproducing. Could you take a try on llama-2-70B?
I can confirm, I'm seeing the same behavior with the llama-3-70B checkpoint as @vip-china
I think it makes sense. This is my script with Vanilla Llama3-70b.
I already mention in this issue #1470 and my comment.
Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.
So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).
The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.
FYI. My script with Vanilla model. And is public on my huggingface as well. Huggingface link.
TRT-LLM 0.11 Main branch. f430a4b447ef4cba22698902d43eae0debf08594 DGX-H100
AWQ script.
python ../quantization/quantize.py --model_dir /root/.cache/huggingface/hub/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a \
--dtype float16 \
--qformat int4_awq \
--awq_block_size 64 \
--output_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
--batch_size 32 \
--tp_size 4 \
--calib_size 512
trtllm-build script.
trtllm-build --checkpoint_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
--output_dir ./llama3-70b-awq-bs128 \
--gpt_attention_plugin float16 \
--max_batch_size 32 \
--max_input_len 4096 \
--max_output_len 4096 \
--context_fmha enable \
--paged_kv_cache enable \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--multi_block_mode enable \
--use_paged_context_fmha enable \
--tokens_per_block 64 \
--workers 4 \
--gemm_plugin auto
run.py
mpirun -n 4 --allow-run-as-root --oversubscribe python3 ../run.py --engine_dir ./llama3-70b-awq-bs128 --tokenizer_dir /code/tensorrt_llm/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a --max_output_len 20 --input_text "I lovefrench quiche"
Output.
Input [Text 0]: "I lovefrench quiche"
Output [Text 0 Beam 0]: " and this one looks so delicious. I love the addition of the spinach and the cheese. I am"
System Info
GPU name (NVIDIA A6000) TensorRT-LLM tage (v0.9.0 main) transformers tage (0.41.0)
Who can help?
@nc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1.quantization python3 convert_checkpoint.py \ --model_dir ./dolphin-2.9-llama3-70b \ --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4 \ --tp_size 4 \ --pp_size 1
1.1 or python ../quantization/quantize.py --model_dir ./dolphin-2.9-llama3-70b \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128 \ --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 \ --calib_size 32 \ --tp_size 4 2.build trtllm-build --checkpoint_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-by --gemm_plugin float16 --max_batch_size 8 --use_custom_all_reduce disable --max_input_len 8192 --max_output_len 4096
3.reasoning python3 run.py --engine_dir ./llama/dolphin-2.9-llama3-70b-new-ljf-int4-by --tokenizer_dir /tensorrtllm_backend/TensorRT-LLM/examples/llama/dolphin-2.9-llama3-70b --max_output_len 20 --input_text "I lovefrench quiche"
Expected behavior
expect Correct answer
actual behavior
if quantified to 4 bits, the answer will become garbled
![企业微信截图_17161209198428](https://github.com/NVIDIA/TensorRT-LLM/assets/119389127/373ab196-0689-487f-af20-5570807fb51a)
If using fp16 precision to answer without garbled characters, but unable to stop
additional notes
prompt: <|im_start|>system You are gpt, a helpful AI assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant