NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.41k stars 800 forks source link

LLama 70B w4a16 generate abnormal result #1701

Closed gloritygithub11 closed 1 month ago

gloritygithub11 commented 1 month ago

System Info

ensorrt 10.0.1 tensorrt-llm 0.11.0.dev2024052800 torch-tensorrt 2.3.0a0

A100 40G

Who can help?

@byshiue

Information

Tasks

Reproduction

build

python ../llama/convert_checkpoint.py --model_dir /mnt/memory/Meta-Llama-3-70B-Instruct --output_dir /mnt/memory/tmp/trt_models/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu

trtllm-build \
    --checkpoint_dir /mnt/memory/tmp/trt_models/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp \
    --output_dir /mnt/memory/tmp/trt_engines/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 1024

run

mpirun --allow-run-as-root -n 1 python3 /app/tensorrt-llm-src/examples/run.py --engine_dir /mnt/memory/tmp/trt_engines/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp --tokenizer_dir /mnt/memory/Meta-Llama-3-70B-Instruct --max_output_len 1024 --input_text "I want to go travel to the newyork city. Can you give me a plan for 5 days?" 

Expected behavior

generate valid result

actual behavior

generate following abnormal result:

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: "<|begin_of_text|>"I want to go travel to the newyork city. Can you give me a plan for 5 days?""
Output [Text 0 Beam 0]: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

additional notes

I tried Meta-Llama-3-8B-Instruct with similar build argument, the result shows correctly

build

python ../llama/convert_checkpoint.py --model_dir /mnt/memory/Meta-Llama-3-8B-Instruct --output_dir /mnt/memory/tmp/trt_models/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu

trtllm-build \
    --checkpoint_dir /mnt/memory/tmp/trt_models/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp \
    --output_dir /mnt/memory/tmp/trt_engines/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 1024

result

mpirun --allow-run-as-root -n 1 python3 /app/tensorrt-llm-src/examples/run.py --engine_dir /mnt/memory/tmp/trt_engines/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp --tokenizer_dir /mnt/memory/Meta-Llama-3-8B-Instruct --max_output_len 1024 --input_text "I want to go travel to the newyork city. Can you give me a plan for 5 days?"
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: ""I want to go travel to the newyork city. Can you give me a plan for 5 days?""
Output [Text 0 Beam 0]: " 

Here's a suggested itinerary for your 5-day trip to New York City:

Day 1: Arrival and Exploring Midtown Manhattan

* Arrive at one of New York City's three major airports (JFK, LGA, or EWR)
* Take a taxi or public transportation to your hotel in Midtown Manhattan
* Check-in to your hotel and freshen up
* Visit Times Square, a bustling area filled with bright lights and giant billboards
* Grab lunch at a classic New York diner like Ellen's Stardilleries or the diner at the Plaza Hotel
* Spend the afternoon exploring the Museum of Modern Art (MoMA) or the American Museum of Natural History
* Enjoy dinner at a classic New York restaurant like Carbone or the Russian Tea Room

Day 2: Central Park and the Upper West Side

* Start the day with a leisurely stroll through Central Park
* Visit the Central Park Zoo and the Alice Henderson Memorial Carousel
* Grab lunch at a café or restaurant near the park
* Spend the afternoon exploring the Upper West Side neighborhood
* Visit the American Museum of Natural History again or explore the nearby Hayden Planetarium
* Enjoy dinner at a classic New York restaurant like Zabar's or the Upper West Side's own Levain Bakery
byshiue commented 1 month ago

Could you try using quantize.py to quantize the model with GPTQ or AWQ? It might be that pure int4-weight only cannot keep the accuracy in this model.

gloritygithub11 commented 1 month ago

I tried use int4 awq, the result looks well. Thanks.