Closed jxchenus closed 4 days ago
@jxchenus Thanks for your interest in TrtLLM. Can you try to use the kv cache resue according to the doc: https://nvidia.github.io/TensorRT-LLM/advanced/kv-cache-reuse.html.
@hello-11 Thank you for sharing the doc link. Which part of the doc did you think I missed following?
@jxchenus If you are running a Triton server, you can enable kv cache reuse with a parameter: parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } }
Thanks @hello-11 !
Here's how I enabled kv cache reuse:
self.runner = runner_cls.from_dir(
engine_dir=engine_dir,
rank=self.rank,
kv_cache_enable_block_reuse=True,
)
Here runner_cls
is ModelRunnerCpp
. I tried removing the line kv_cache_enable_block_reuse=True,
and got the same error as reported above.
Let me try to use the parameter "enable_kv_cache_reuse"
and see what I get. Thank you again!
I have tried adding parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } }
to config.pbtxt
for my model, and passing it on to the sampling parameters. As long as TP>1, it produces the same error.
@jxchenus If I understood correctly the error also occurs without KV cache reuse? So the issue is about prompt table with TP>1?
@Funatiq Thank you for looking into this!
I have a test case where the model is built with --use_paged_context_fmha enable
option, and the runner is instantiated without kv_cache_enable_block_reuse=True
argument, runner.generate
is called without input_token_extra_ids
option, and the same stack trace is still dumped.
Please let me know if you'd like me to try building the model differently to see if it reproduces.
same problem with prompt table and with tp4 https://github.com/NVIDIA/TensorRT-LLM/issues/2358 (kv cache reuse is disabled)
I can reproduce the same error reported in https://github.com/NVIDIA/TensorRT-LLM/issues/2417 by running mpirun of a Python script outside of tritonserver.
I am running with TP=2, and the root node is the one that logs this error stack, while rank 1 completes the generation successfully. This is consistent with what I'm seeing in tritonserver.
this is not a Triton problem, I absolutely agree with you @jxchenus (I don’t use Triton)
I just tested inside a new container built with TensorRT-LLMTensorRT-LLM@main 535c9cc, and confirm that this bug is no longer reproducible.
I was also provided with a patch, but the update is inside some closed-source code.
@akhoroshev we root cause it's issue with trtllm, and this issue is resolved in latest main (will also be included in next stable release). Please try and see if it works for you. Thanks!
System Info
AWS EC2 instance: g6e.48xlarge TensorRT-LLM v0.13.0 Triton Inference Server v2.50.0 Nvidia
24.09-py3-min
used as based image for docker templateWho can help?
@xuanzic
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
--gpus '"device=4,5,6,7"'
trtllm-build --checkpoint_dir $CONTAINER_LLM_CKPT_DIR/ckpt_tp$LLM_TP_DEGREE/ \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --max_batch_size 1 \ --output_dir $CONTAINER_LLM_REPO_DIR/tensorrt_llm/1/engine/ \ --max_beam_width 1 \ --max_input_len 1280 \ --max_num_tokens 2048 \ --max_prompt_embedding_table_size 1024 \ --context_fmha enable \ --remove_input_padding enable \ --bert_attention_plugin bfloat16 \ --paged_kv_cache enable \ --use_paged_context_fmha enable \ --use_fused_mlp enable \ --max_seq_len 2048 \ --max_multimodal_len 4096
grpcclient.InferenceServerClient
:Expected behavior
Inference should succeed.
actual behavior
Inference hangs on the client. The following stack trace is logged on tritonserver:
additional notes
With the same model, configuration and
model.py
code, if I just change TP to 1 when converting and building the model, inference succeeds.