TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
mpirun --allow-run-as-root -n 1 python3 /app/tensorrt-llm-src/examples/run.py --engine_dir /mnt/memory/tmp/trt_engines/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp --tokenizer_dir /mnt/memory/Meta-Llama-3-70B-Instruct --max_output_len 1024 --input_text "I want to go travel to the newyork city. Can you give me a plan for 5 days?"
Expected behavior
generate valid result
actual behavior
generate following abnormal result:
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: "<|begin_of_text|>"I want to go travel to the newyork city. Can you give me a plan for 5 days?""
Output [Text 0 Beam 0]: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
additional notes
I tried Meta-Llama-3-8B-Instruct with similar build argument, the result shows correctly
mpirun --allow-run-as-root -n 1 python3 /app/tensorrt-llm-src/examples/run.py --engine_dir /mnt/memory/tmp/trt_engines/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp --tokenizer_dir /mnt/memory/Meta-Llama-3-8B-Instruct --max_output_len 1024 --input_text "I want to go travel to the newyork city. Can you give me a plan for 5 days?"
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: ""I want to go travel to the newyork city. Can you give me a plan for 5 days?""
Output [Text 0 Beam 0]: "
Here's a suggested itinerary for your 5-day trip to New York City:
Day 1: Arrival and Exploring Midtown Manhattan
* Arrive at one of New York City's three major airports (JFK, LGA, or EWR)
* Take a taxi or public transportation to your hotel in Midtown Manhattan
* Check-in to your hotel and freshen up
* Visit Times Square, a bustling area filled with bright lights and giant billboards
* Grab lunch at a classic New York diner like Ellen's Stardilleries or the diner at the Plaza Hotel
* Spend the afternoon exploring the Museum of Modern Art (MoMA) or the American Museum of Natural History
* Enjoy dinner at a classic New York restaurant like Carbone or the Russian Tea Room
Day 2: Central Park and the Upper West Side
* Start the day with a leisurely stroll through Central Park
* Visit the Central Park Zoo and the Alice Henderson Memorial Carousel
* Grab lunch at a café or restaurant near the park
* Spend the afternoon exploring the Upper West Side neighborhood
* Visit the American Museum of Natural History again or explore the nearby Hayden Planetarium
* Enjoy dinner at a classic New York restaurant like Zabar's or the Upper West Side's own Levain Bakery
System Info
ensorrt 10.0.1 tensorrt-llm 0.11.0.dev2024052800 torch-tensorrt 2.3.0a0
A100 40G
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
build
run
Expected behavior
generate valid result
actual behavior
generate following abnormal result:
additional notes
I tried Meta-Llama-3-8B-Instruct with similar build argument, the result shows correctly
build
result