NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.33k stars 936 forks source link

How does streamingllm support unlimited input and output? #1474

Closed zhangfeiyu5610 closed 1 month ago

zhangfeiyu5610 commented 5 months ago

When I experiment with streamingllm in llama with this (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-streamingllm), i wonder why always report length-related errors。when i run the run.py script,if i set max_seq_len longer than model_config.max_seq_len,error follows as: ''' Traceback (most recent call last): File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 565, in main(args) File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 414, in main runner = runner_cls.from_dir(**runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 168, in from_dir assert max_seq_len <= model_config.max_seq_len AssertionError ''' i dont know how streamingllm works, how can i fix it?

### Tasks
byshiue commented 5 months ago

Please follow the template to share your reproduced steps.

zhangfeiyu5610 commented 5 months ago

first, i build the llama-7b model follow this ''' python convert_checkpoint.py --model_dirllama_7B-hf \ --output_dir checkpoint_trt/llama_7B-hf \ --dtype float16

trtllm-build --checkpoint_dir checkpoint_trt/llama_7B-hf \ --output_dir /data/uclai/trt_models/llama_7B-hf \ --gemm_plugin float16 \ --streamingllm enable ''' then i try to use model to generate tokens follow this ''' python3 ../run.py --max_output_len=50 \ --tokenizer_dir llama_7B-hf \ --engine_dir=/data/uclai/trt_models/llama_7B-hf \ --max_attention_window_size=2048 \ --sink_token_length=4 ''' result as ''' Input [Text 0]: " Born in north-east France, Soyer trained as a" Output [Text 0 Beam 0]: len(output_text)=186 "chef in Paris and London before moving to New York in 1850. He was the first chef to be hired by the newly opened Delmonico’s restaurant, where he worked for 10 years. He then opened his" ''' The result is normal,but when i try to generate more tokens by streamllm follow this ''' python3 ../run.py --max_output_len=4096 \ --tokenizer_dir llama_7B-hf \ --engine_dir=/data/uclai/trt_models/llama_7B-hf \ --max_attention_window_size=2048 \ --sink_token_length=4 ''' error as: ''' Traceback (most recent call last): File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 565, in main(args) File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 414, in main runner = runner_cls.from_dir(**runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 168, in from_dir assert max_seq_len <= model_config.max_seq_len AssertionError ''' I want to know how to set up streamllm to support generating unlimited lengths?

byshiue commented 5 months ago

Your output sequence length during running run.py is larger than the max sequence length of engine (it should be set as 2048 by default).

zhangfeiyu5610 commented 5 months ago

How should I cancel the length limit of this engine, because I want to generate length as long as possible through streamingllm

byshiue commented 4 months ago

You should set the max input length durgin building engine.