NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.48k stars 811 forks source link

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

Open vip-china opened 1 month ago

vip-china commented 1 month ago

System Info

GPU name (NVIDIA A6000) TensorRT-LLM tage (v0.9.0 main) transformers tage (0.41.0)

Who can help?

@nc

Information

Tasks

Reproduction

1.quantization python3 convert_checkpoint.py \ --model_dir ./dolphin-2.9-llama3-70b \ --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4 \ --tp_size 4 \ --pp_size 1

1.1 or python ../quantization/quantize.py --model_dir ./dolphin-2.9-llama3-70b \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128 \ --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 \ --calib_size 32 \ --tp_size 4 2.build trtllm-build --checkpoint_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-by --gemm_plugin float16 --max_batch_size 8 --use_custom_all_reduce disable --max_input_len 8192 --max_output_len 4096

3.reasoning python3 run.py --engine_dir ./llama/dolphin-2.9-llama3-70b-new-ljf-int4-by --tokenizer_dir /tensorrtllm_backend/TensorRT-LLM/examples/llama/dolphin-2.9-llama3-70b --max_output_len 20 --input_text "I lovefrench quiche"

Expected behavior

expect Correct answer

actual behavior

if quantified to 4 bits, the answer will become garbled 9028b07e31b61a471df7c8c898ba6c8 企业微信截图_17161209198428

If using fp16 precision to answer without garbled characters, but unable to stop

additional notes

prompt: <|im_start|>system You are gpt, a helpful AI assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant

byshiue commented 1 month ago

Do you encounter same issue on LLaMA 2-70B?

vip-china commented 1 month ago

Do you encounter same issue on LLaMA 2-70B?

The current test is llama3, and llama-2-70B has not been tested before. Is this related to int4/awq, FP16 is normal

byshiue commented 1 month ago

Because we don't observe such issue on llama-2-70B with int4-awq, and we don't have llama-3-70B ckpt now. So, we hope to find an baseline model as reference to help reproducing. Could you take a try on llama-2-70B?

smehta2000 commented 1 month ago

I can confirm, I'm seeing the same behavior with the llama-3-70B checkpoint as @vip-china

matichon-vultureprime commented 1 month ago

I think it makes sense. This is my script with Vanilla Llama3-70b.

I already mention in this issue #1470 and my comment.

Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.

So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).

The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.

FYI. My script with Vanilla model. And is public on my huggingface as well. Huggingface link.

TRT-LLM 0.11 Main branch. f430a4b447ef4cba22698902d43eae0debf08594 DGX-H100

AWQ script.

python  ../quantization/quantize.py --model_dir /root/.cache/huggingface/hub/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a \
                                         --dtype float16 \
                                         --qformat int4_awq \
                                         --awq_block_size 64 \
                                         --output_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
                                         --batch_size 32 \
                                         --tp_size 4 \
                                         --calib_size 512

trtllm-build script.

trtllm-build --checkpoint_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
             --output_dir ./llama3-70b-awq-bs128 \
             --gpt_attention_plugin float16 \
             --max_batch_size 32 \
             --max_input_len 4096 \
             --max_output_len 4096 \
             --context_fmha enable \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --multi_block_mode enable \
             --use_paged_context_fmha enable \
             --tokens_per_block 64 \
             --workers 4 \
             --gemm_plugin auto

run.py

mpirun -n 4 --allow-run-as-root --oversubscribe python3 ../run.py --engine_dir ./llama3-70b-awq-bs128 --tokenizer_dir /code/tensorrt_llm/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a --max_output_len 20 --input_text "I lovefrench quiche"

Output.

Input [Text 0]: "I lovefrench quiche"
Output [Text 0 Beam 0]: " and this one looks so delicious. I love the addition of the spinach and the cheese. I am"