Closed TheCodeWrangler closed 3 months ago
I have a tried a very similar process using the v0.9.0
tag (saw same results as above)
I have also tried with the two latest commits to the main
branch (though there were some changes required to my convert/compile args). In the main
branches I would get backend exceptions, so have settled with the rel
branch for my Issue as it seems closest to working.
I have recompiled with --max_batch_size 1
and it appears to have resolved my issue but reduces my throughput significantly
I actually have always been a bit unclear on the interaction of batch size with inflight-fused-batching
. Any light you could shed on the interaction would be appreciated.
could try disabling use_custom_all_reduce
and use trtllm 0.10 or pip install tensorrt_llm== 0.11.0.dev2024061800
?
Disabling use_custom_all_reduce fixed the issue!
Have not tried with newer images
System Info
Debian 11
nvidia-smi
nvcc --version
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am seeking to use a set of LoRa Weights (trained with linear 1.75 rope scaling and a 875000 rotary base) on a llama3-8B base model. I am planning to deploy to 2X L4 GPUs and would like to support 14,000
compiled the
rel
branch of triton inference server tensorrt-llm backend (also usesrel
of tensorrt-llm). I have been approaching this path to ensure that the container I will serve using is identical to the one I use for compilation.I am updating the
config.json
file within the LLama3B base model for the rope scaline parameters used in training the LoRa adapters:I am then using this container to compile a llama3-8B base model for tensor parallelism 2 using the following convert/build commands.
I have additional conversions to make my lora base weights into warmup files which i am using to initialize my lora weights. Leaving out these details here (though I might make a PR to provide them in the backend repo)
I then start my inference server and warmup runs successfully.
When I send sequential single inference traffic all adapters produce results of high quality. When I run several concurrent requests (and begin utilizing in-flight batching) the results degrade. The same input when run as the only thing in flight will give different results than if it is running while other inferences are in-flight.
Expected behavior
Inference results are deterministic (beam size 3 and I am passing random seed as well) and do not change when in flight batching active.
actual behavior
Results are only deterministic if it is the only inference in flight.
additional notes
I am willing to repost in the https://github.com/triton-inference-server/tensorrtllm_backend repo if the root cause is in that code.