Mistral with dtype=bf16 produces garbage for large prompt lengths with TRT-LLM v0.8.0

maaquib commented 4 months ago

System Info

GPU Name: EC2 g5.12xl w/ 4 NVIDIA A10G
TensorRT-LLM: 0.8.0
Nvidia Driver: 535.161.08
Container: nvidia/cuda:12.1.0-devel-ubuntu22.04
OS: Ubuntu 22.04

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

$ docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com
git clone https://github.com/NVIDIA/TensorRT-LLM.git -b v0.8.0
cd TensorRT-LLM/examples/llama
git clone https://USERNAME:TOKEN@huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
pip install "numpy<2"
python3 convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
                              --output_dir trt_engines/tllm_checkpoint_1gpu_mistral \
                              --dtype bfloat16
trtllm-build --checkpoint_dir trt_engines/tllm_checkpoint_1gpu_mistral \
             --output_dir trt_engines/bf16/1-gpu/ \
             --gemm_plugin bfloat16 \
             --gpt_attention_plugin bfloat16 \
             --max_input_len 16384 \
             --max_output_len 1024

## Correct behavior with a shorter prompt
head -600 /tmp/prompt.txt > prompt_tokens.txt
echo "\n. What is Gregor's role?" >> prompt_tokens.txt

python3 ../run.py --max_input_length 16384 \
                  --max_output_len 256 \
                  --input_text "$(cat prompt_tokens.txt)" \
                  --tokenizer_dir Mistral-7B-Instruct-v0.2 \
                  --engine_dir trt_engines/bf16/1-gpu

## Garbage response around prompt len ~8k
cat /tmp/prompt.txt > prompt_tokens.txt
echo "\n. What is Gregor's role?" >> prompt_tokens.txt

python3 ../run.py --max_input_length 16384 \
                  --max_output_len 256 \
                  --input_text "$(cat prompt_tokens.txt)" \
                  --tokenizer_dir Mistral-7B-Instruct-v0.2 \
                  --engine_dir trt_engines/bf16/1-gpu

Expected behavior

Response is valid text

Output [Text 0 Beam 0]: "

Gregor is the son of the old friend of Werle, who is a merchant in Bergen. He has been studying at the university in Copenhagen. He has been engaged in photography for some time. He has been in love with a girl, Gina Hansen, who was the housekeeper in the merchant’s house in Bergen. She was married to a man, Hjalmar Ekdal, who was a clerk in the merchant’s office. Hjalmar Ekdal was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He was a man of a very peculiar character. He"

actual behavior

Response is gibberish

Output [Text 0 Beam 0]: "Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question Question"

additional notes

prompt.txt

hijkzzz commented 4 months ago

Please try the latest version of TRT-LLM See the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

maaquib commented 4 months ago

@hijkzzz Tried with the latest

$ docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.4.0-devel-ubuntu22.04

apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
git clone https://USERNAME:TOKEN@huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
python3 convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
                              --output_dir trt_engines/tllm_checkpoint_1gpu_mistral \
                              --dtype bfloat16
trtllm-build --checkpoint_dir trt_engines/tllm_checkpoint_1gpu_mistral \
             --output_dir trt_engines/bf16/1-gpu/ \
             --gemm_plugin bfloat16 \
             --gpt_attention_plugin bfloat16 \
             --max_input_len 16384 \
             --max_output_len 1024

## Correct behavior with a shorter prompt
head -600 /tmp/prompt.txt > prompt_tokens.txt
echo "\n. What is Gregor's role?" >> prompt_tokens.txt

python3 ../run.py --max_input_length 16384 \
                  --max_output_len 256 \
                  --input_text "$(cat prompt_tokens.txt)" \
                  --tokenizer_dir Mistral-7B-Instruct-v0.2 \
                  --engine_dir trt_engines/bf16/1-gpu

## Seems like maximum input length being is overriden to 8192
cat /tmp/prompt.txt > prompt_tokens.txt
echo "\n. What is Gregor's role?" >> prompt_tokens.txt

python3 ../run.py --max_input_length 16384 \
                  --max_output_len 256 \
                  --input_text "$(cat prompt_tokens.txt)" \
                  --tokenizer_dir Mistral-7B-Instruct-v0.2 \
                  --engine_dir trt_engines/bf16/1-gpu

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024062500
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[06/25/2024-23:30:55] [TRT-LLM] [I] Load engine takes: 7.980865001678467 sec
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Prompt length (15072) exceeds maximum input length (8192). (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:227)
1       0x7f85323ee1a4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7591a4) [0x7f85323ee1a4]
2       0x7f85340f4567 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455
3       0x7f86c6eb0253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f86c6eb0253]
4       0x7f8741643ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8741643ac3]
5       0x7f87416d4a04 clone + 68
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/llama/../run.py", line 503, in <module>
    main(args)
  File "/TensorRT-LLM/examples/llama/../run.py", line 343, in main
    outputs = runner.generate(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 466, in generate
    return self._initialize_and_fill_output(request_ids, end_id,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 520, in _initialize_and_fill_output
    return self._fill_output(responses, output_ids, end_id, return_dict,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 556, in _fill_output
    raise RuntimeError(response.error_msg)
RuntimeError: Encountered an error when fetching new request: Prompt length (15072) exceeds maximum input length (8192). (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:227)
1       0x7f85323ee1a4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7591a4) [0x7f85323ee1a4]
2       0x7f85340f4567 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455
3       0x7f86c6eb0253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f86c6eb0253]
4       0x7f8741643ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8741643ac3]
5       0x7f87416d4a04 clone + 68

hijkzzz commented 4 months ago

It seems that you did not set the maximum input length correctly

RuntimeError: Encountered an error when fetching new request: Prompt length (15072) exceeds maximum input length (8192). (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:227)

maaquib commented 4 months ago

I noticed some of the args have been deprecated in favour of newer ones in the latest version. I finally got it working with the following

trtllm-build --checkpoint_dir trt_engines/tllm_checkpoint_1gpu_mistral \
             --output_dir trt_engines/bf16/1-gpu/ \
             --gemm_plugin bfloat16 \
             --gpt_attention_plugin bfloat16 \
             --max_seq_len 17408 \
             --max_input_len 16384 \
             --max_num_tokens 16384

NVIDIA / TensorRT-LLM