NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.67k stars 990 forks source link

Inference time for Mixtral-8x7B model is slowing down with every new request #1097

Closed punkerpunker closed 7 months ago

punkerpunker commented 9 months ago

System Info

GPUs: 2xA100 PCI-e

Who can help?

@kaiyux

Information

Tasks

Reproduction

Using the sources from the branch corresponding to the v0.7.1 tag

Building the model:

python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1 \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin \
                --world_size 2 \
                --pp_size 2 \
                --output_dir ./trt_engines/mixtral/PP

Following steps from here to pack it into triton inference server.

Sending such requests:

headers = {
    "Content-Type": "application/json",
}

data = {
    "text_input": "Generate a random text up the max number of new tokens", 
    "max_tokens": 300, 
    "bad_words": "", 
    "stop_words": ""
}

response = requests.post('.../v2/models/tensorrt_llm_bls/generate', headers=headers, json=data) 

Expected behavior

TensorRT gives better performance than TGI (~2.5RPS for quantized model with 300 output tokens)

actual behavior

glebvazhenin@RYG7YPT4W7 ~ % k6 run script.js   

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 250 max VUs, 8m30s max duration (incl. graceful stop):
              * contacts: Up to 6.00 iterations/s for 8m0s over 3 stages (maxVUs: 250, gracefulStop: 30s)

INFO[0014] String input: Generate a random text up the max number of new tokens
INFO[0014] Status: 200                                   source=console
INFO[0014] Response time: 5970.938 ms                    source=console
INFO[0014] Generated tokens: ...
INFO[0020] Status: 200                                   source=console
INFO[0020] Response time: 8742.619 ms                    source=console
INFO[0020] Generated tokens: ...
INFO[0026] Status: 200                                   source=console
INFO[0026] Response time: 12220.316 ms                   source=console
INFO[0026] Generated tokens: ...
INFO[0032] Status: 200                                   source=console
INFO[0032] Response time: 16089.603 ms                   source=console
INFO[0032] Generated tokens: ...
INFO[0037] Status: 200                                   source=console
INFO[0037] Response time: 20116.343 ms                   source=console
INFO[0037] Generated tokens: ...
INFO[0043] Status: 200                                   source=console
INFO[0043] Response time: 24414.768 ms                   source=console
INFO[0043] Generated tokens: ...
INFO[0049] Status: 200                                   source=console
INFO[0049] Response time: 28801.644 ms                   source=console
INFO[0049] Generated tokens: ...
INFO[0055] Status: 200                                   source=console
INFO[0055] Response time: 33385.066 ms                   source=console
INFO[0055] Generated tokens: ...

Load Test is ramping up users up to 5 RPS over first three minutes, so the above is ~0.15 RPS.

As you may see, the response time is quickly increasing. Moreover, it won't fall back after the requests are processed, it feels like they're stuck within the model until the container with triton is restarted. Also, the GPU voltage stays high after that load (even after some time the load is released).

additional notes

Using pre-built triton server image: vcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 Not sure if that's a triton problem or a TensorRT-LLM though. Any pointer to take a look at would be much appreciated. Thanks!

1350lumen commented 8 months ago

It looks like the default examples are not configured for high throughput. The increasing response time is the effect of requests queuing up either in front of tensorrt_llm_bls or tensorrt_llm models.

tensorrt_llm_bls is a python backend and it generally needs multiple instances to handle concurrent requests. When there is 1 instance only, it handles 1 long running request at a time. To fix it, the options are: to increase the number of python instances for BLS model, use ensemble model or create custom C++ backend. Ensemble should be best for a quick start (when no token streaming needed)

tensorrt_llm - for best throughput the TRT engine must be built with in-flight batching enabled (it's done by default) with a good max_batch_size. If max_batch_size is small (it can be checked in tritonserver logs with --log-verbose=1), the batching will not have any effect at all. Also, config.pbtxt should have "gpt_model_type" mandatory option set to a type supporting inflight batching (e.g. inflight_fused_batching)

punkerpunker commented 7 months ago

Managed to build the engine with the proper throughput using the following parameters (1 A100 GPU)

trtllm-build --checkpoint_dir /tensorrt_engines/tllm_checkpoint_mixtral_1gpu_int8 --output_dir /tensorrt_engines_converted/tllm_checkpoint_mixtral_1gpu_int8_bs64_optim --gemm_plugin float16 --max_batch_size 64 --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --max_input_len 2048 --max_output_len 512 --max_num_tokens 16384 --use_fused_mlp