Closed xesdiny closed 8 months ago
well, I found that I could use AsyncLLMEngine for that.
Yes, AsyncLLMEngine
is the best option. There will also be a standard solution to integrate TRT-LLM
with the Python backend of Triton
soon. Leaving for @schetlur-nv to comment.
System Info
-CPU architecture x86_64 -GPU name NVIDA V100 -GPU memory size 32G*8 -TensorRT-LLM branch v0.7.1 -TensorRT-LLM commit 80bc075
Who can help?
@ncomly-nvidia
Information
My goal is to use
pybind
to rely ontriton python backend
to start multi-card inference deployment. When TP=1, The request link is complete and successful. But when I set multi-GPU deployment with TP=4, the first request returned successfully, but the second time it threw a GEMM error in theexecuteContextStep
stage.Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
There are
tensorrt_llm/1/model.py
&config.pbtxt
contextconfig.pbtxt
Expected behavior
actual behavior
When making the second request call:
additional notes
The client I use refers to this code
tensorrtllm_backend
3a61c37tools/gpt/client.py