Closed punkerpunker closed 7 months ago
It looks like the default examples are not configured for high throughput. The increasing response time is the effect of requests queuing up either in front of tensorrt_llm_bls or tensorrt_llm models.
tensorrt_llm_bls is a python backend and it generally needs multiple instances to handle concurrent requests. When there is 1 instance only, it handles 1 long running request at a time. To fix it, the options are: to increase the number of python instances for BLS model, use ensemble model or create custom C++ backend. Ensemble should be best for a quick start (when no token streaming needed)
tensorrt_llm - for best throughput the TRT engine must be built with in-flight batching enabled (it's done by default) with a good max_batch_size. If max_batch_size is small (it can be checked in tritonserver logs with --log-verbose=1), the batching will not have any effect at all. Also, config.pbtxt should have "gpt_model_type" mandatory option set to a type supporting inflight batching (e.g. inflight_fused_batching)
Managed to build the engine with the proper throughput using the following parameters (1 A100 GPU)
trtllm-build --checkpoint_dir /tensorrt_engines/tllm_checkpoint_mixtral_1gpu_int8 --output_dir /tensorrt_engines_converted/tllm_checkpoint_mixtral_1gpu_int8_bs64_optim --gemm_plugin float16 --max_batch_size 64 --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --max_input_len 2048 --max_output_len 512 --max_num_tokens 16384 --use_fused_mlp
System Info
GPUs: 2xA100 PCI-e
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Using the sources from the branch corresponding to the
v0.7.1
tagBuilding the model:
Following steps from here to pack it into triton inference server.
Sending such requests:
Expected behavior
TensorRT gives better performance than TGI (~2.5RPS for quantized model with 300 output tokens)
actual behavior
Load Test is ramping up users up to 5 RPS over first three minutes, so the above is ~0.15 RPS.
As you may see, the response time is quickly increasing. Moreover, it won't fall back after the requests are processed, it feels like they're stuck within the model until the container with triton is restarted. Also, the GPU voltage stays high after that load (even after some time the load is released).
additional notes
Using pre-built triton server image:
vcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
Not sure if that's a triton problem or a TensorRT-LLM though. Any pointer to take a look at would be much appreciated. Thanks!