Closed maaquib closed 4 months ago
Please try the latest version of TRT-LLM See the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html
@hijkzzz Tried with the latest
$ docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.4.0-devel-ubuntu22.04
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
git clone https://USERNAME:TOKEN@huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
python3 convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
--output_dir trt_engines/tllm_checkpoint_1gpu_mistral \
--dtype bfloat16
trtllm-build --checkpoint_dir trt_engines/tllm_checkpoint_1gpu_mistral \
--output_dir trt_engines/bf16/1-gpu/ \
--gemm_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--max_input_len 16384 \
--max_output_len 1024
## Correct behavior with a shorter prompt
head -600 /tmp/prompt.txt > prompt_tokens.txt
echo "\n. What is Gregor's role?" >> prompt_tokens.txt
python3 ../run.py --max_input_length 16384 \
--max_output_len 256 \
--input_text "$(cat prompt_tokens.txt)" \
--tokenizer_dir Mistral-7B-Instruct-v0.2 \
--engine_dir trt_engines/bf16/1-gpu
## Seems like maximum input length being is overriden to 8192
cat /tmp/prompt.txt > prompt_tokens.txt
echo "\n. What is Gregor's role?" >> prompt_tokens.txt
python3 ../run.py --max_input_length 16384 \
--max_output_len 256 \
--input_text "$(cat prompt_tokens.txt)" \
--tokenizer_dir Mistral-7B-Instruct-v0.2 \
--engine_dir trt_engines/bf16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024062500
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[06/25/2024-23:30:55] [TRT-LLM] [I] Load engine takes: 7.980865001678467 sec
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Prompt length (15072) exceeds maximum input length (8192). (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:227)
1 0x7f85323ee1a4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7591a4) [0x7f85323ee1a4]
2 0x7f85340f4567 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455
3 0x7f86c6eb0253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f86c6eb0253]
4 0x7f8741643ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8741643ac3]
5 0x7f87416d4a04 clone + 68
Traceback (most recent call last):
File "/TensorRT-LLM/examples/llama/../run.py", line 503, in <module>
main(args)
File "/TensorRT-LLM/examples/llama/../run.py", line 343, in main
outputs = runner.generate(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 466, in generate
return self._initialize_and_fill_output(request_ids, end_id,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 520, in _initialize_and_fill_output
return self._fill_output(responses, output_ids, end_id, return_dict,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 556, in _fill_output
raise RuntimeError(response.error_msg)
RuntimeError: Encountered an error when fetching new request: Prompt length (15072) exceeds maximum input length (8192). (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:227)
1 0x7f85323ee1a4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7591a4) [0x7f85323ee1a4]
2 0x7f85340f4567 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455
3 0x7f86c6eb0253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f86c6eb0253]
4 0x7f8741643ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8741643ac3]
5 0x7f87416d4a04 clone + 68
It seems that you did not set the maximum input length correctly
RuntimeError: Encountered an error when fetching new request: Prompt length (15072) exceeds maximum input length (8192). (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:227)
I noticed some of the args have been deprecated in favour of newer ones in the latest version. I finally got it working with the following
trtllm-build --checkpoint_dir trt_engines/tllm_checkpoint_1gpu_mistral \
--output_dir trt_engines/bf16/1-gpu/ \
--gemm_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--max_seq_len 17408 \
--max_input_len 16384 \
--max_num_tokens 16384
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Response is valid text
actual behavior
Response is gibberish
additional notes
prompt.txt