Get repeated answers of no more than three words while deploy LLaMA3-Instruct-8B with Triton server使用Triton部署LLaMA3模型时遇到重复生成问题

AndyZZt commented 1 month ago

Environment CPU architecture: x86_64 CPU/Host memory size: 32G GPU properties: SM86 GPU name: NVIDIA A10 GPU memory size: 24G Clock frequencies used: 1695MHz Libraries TensorRT-LLM: v0.9.0 TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),
CUDA: 12.3, Container used : 24.04-trtllm-python-py3 NVIDIA driver version: 535.161.08 OS : Ubuntu 22.04

Reproduction Steps

docker exec -it trtllm1 /bin/bash
mamba deactivate
mamba deactivate

# git from correct branch
git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git  
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git  

# build trt engines
cd TensorRT-LLM
trtllm-build --checkpoint_dir ../Work/TensorRT-LLM/examples/llama/tllm_checkpoint_1gpu_tp1 \
            --output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
            --remove_input_padding enable \
            --gpt_attention_plugin float16 --gemm_plugin float16 \
            --context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
            --max_input_len 512 --max_output_len 512 \
            --max_batch_size 64

# copy rank0.engine & config.json
cd ../tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

# model configuration
export HF_LLAMA_MODEL=/path/to/llama3-8B-Instruct-hf
export ENGINE_PATH=/path/to/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,repetition_penalty:0.9,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,decoding_mode:top_p,enable_chunked_context:True,batch_scheduler_policy:max_utilization,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:v1,enable_trt_overlap:True,max_queue_delay_microseconds:0

# launch triton-server
python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1

# send request via curl
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "what are flowers","max_tokens": 100,"bad_words":[""],"stop_words":["<|eot_id|>"]}'

Expected Behavior

normal answer

Actual Behavior

Additional Notes When I simply run TensorRT-LLM locally, via

python3 ../run.py --tokenizer_dir ./tmp/llama/8B/ \
                  --engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
                  --input_text "How to build tensorrt engine?" \
                  --max_output_len 100

The model can answer normally.

hijkzzz commented 1 month ago

Can you try turning off inflightbatching? This may be a known bug (of inflightbatching) that will be fixed in the latest commit.

AndyZZt commented 1 month ago

Can you try turning off inflightbatching? This may be a known bug (of inflightbatching) that will be fixed in the latest commit.

I already turned off in-flight batching by setting"batching_strategy:v1" in tensorrt_llm/config.pbtxt. Or do you mean change the parameters while building TensorRT engine? By the way, I saw someone else got all-well answers using almost the same command lines as me, except he used TensorRT-LLM&tensorllm_backend repository version 0.8.0 while mine are 0.9.0.

hijkzzz commented 1 month ago

Thanks. Please try disabling these options. I think one of them may be causing the problem,

--context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \

In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes

AndyZZt commented 1 month ago

Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes

Thank you for your guidance! After disabling all the options you mentioned, I can use it under v1 mode normally. However, I still have two questions: 1. Which specific parameter caused the bug, and has it been reported? 2. Does everyone else encounter this particular bug when using it? If not, what should I do in the 0.9.0 version to enable the in-flight mode? Because I encountered lots of version adaptation problems when deploying the latest version. Many thanks.

hijkzzz commented 1 month ago

Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes
Thank you for your guidance! After disabling all the options you mentioned, I can use it under v1 mode normally. However, I still have two questions: 1. Which specific parameter caused the bug, and has it been reported? 2. Does everyone else encounter this particular bug when using it? If not, what should I do in the 0.9.0 version to enable the in-flight mode? Because I encountered lots of version adaptation problems when deploying the latest version. Many thanks.

We will update the main branch code soon (today?), including the triton server so that you can try the new version at that time, and please give feedback if you have the same issue. For the root cause, try turning off these options one by one to reduce the performance regression. The older TRT-LLM version of v0.9 is not recommended. The Main Branch always contains the latest bug fixes.

AndyZZt commented 1 month ago

Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes
Thank you for your guidance! After disabling all the options you mentioned, I can use it under v1 mode normally. However, I still have two questions: 1. Which specific parameter caused the bug, and has it been reported? 2. Does everyone else encounter this particular bug when using it? If not, what should I do in the 0.9.0 version to enable the in-flight mode? Because I encountered lots of version adaptation problems when deploying the latest version. Many thanks.
We will update the main branch code soon (today?), including the triton server so that you can try the new version at that time, and please give feedback if you have the same issue. For the root cause, try turning off these options one by one to reduce the performance regression. The older TRT-LLM version of v0.9 is not recommended. The Main Branch always contains the latest bug fixes.

Thanks, I'll try the main branch. I tried to turn on the parameters one by one and finally found that it was all about the "streamingllm" parameter, which supports long context and always keeps the first S tokens in the attention window. After turning off it, in-flight worked. However, when I turn on streamingllm and send parameters it required(max_attention_window_size&sink_attention_size) in http post, triton give back repeated answers again. So it's another story about streamingllm feature :) Thanks for help again. I'll close this issue.

AndyZZt commented 1 month ago

Issues about streamingllm feature.

NVIDIA / TensorRT-LLM

Get repeated answers of no more than three words while deploy LLaMA3-Instruct-8B with Triton server使用Triton部署LLaMA3模型时遇到重复生成问题 #1713