Closed zhangfeiyu5610 closed 1 month ago
Please follow the template to share your reproduced steps.
first, i build the llama-7b model follow this ''' python convert_checkpoint.py --model_dirllama_7B-hf \ --output_dir checkpoint_trt/llama_7B-hf \ --dtype float16
trtllm-build --checkpoint_dir checkpoint_trt/llama_7B-hf \
--output_dir /data/uclai/trt_models/llama_7B-hf \
--gemm_plugin float16 \
--streamingllm enable
'''
then i try to use model to generate tokens follow this
'''
python3 ../run.py --max_output_len=50 \
--tokenizer_dir llama_7B-hf \
--engine_dir=/data/uclai/trt_models/llama_7B-hf \
--max_attention_window_size=2048 \
--sink_token_length=4
'''
result as
'''
Input [Text 0]: " Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: len(output_text)=186
"chef in Paris and London before moving to New York in 1850. He was the first chef to be hired by the newly opened Delmonico’s restaurant, where he worked for 10 years. He then opened his"
'''
The result is normal,but when i try to generate more tokens by streamllm follow this
'''
python3 ../run.py --max_output_len=4096 \
--tokenizer_dir llama_7B-hf \
--engine_dir=/data/uclai/trt_models/llama_7B-hf \
--max_attention_window_size=2048 \
--sink_token_length=4
'''
error as:
'''
Traceback (most recent call last):
File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 565, in
Your output sequence length during running run.py is larger than the max sequence length of engine (it should be set as 2048 by default).
How should I cancel the length limit of this engine, because I want to generate length as long as possible through streamingllm
You should set the max input length durgin building engine.
When I experiment with streamingllm in llama with this (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-streamingllm), i wonder why always report length-related errors。when i run the run.py script,if i set max_seq_len longer than model_config.max_seq_len,error follows as: ''' Traceback (most recent call last): File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 565, in
main(args)
File "tensorrt_llm/v0.8.0/examples/seq_monkey/../run.py", line 414, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 168, in from_dir
assert max_seq_len <= model_config.max_seq_len
AssertionError
'''
i dont know how streamingllm works, how can i fix it?