NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.16k stars 902 forks source link

How to benchmark offline throughput? #73

Open zhaoyang-star opened 10 months ago

zhaoyang-star commented 10 months ago
  1. The scripts in benchmarks are only for latency benchmark. As we can compare it with other LLM inference framworks, is there any demo for offline throughput? I have compiled llm engine with option --use_inflight_batching.
  2. Is build option --use_inflight_batching compatible with --use_gpt_attention_plugin bfloat16? The README said Note that in-flight batching in C++ runtime works only with attention plugin --use_gpt_attention_plugin=float16, paged KV cache --paged_kv_cache and with packed data --remove_input_padding.
juney-nvidia commented 10 months ago

_> 1. The scripts in benchmarks are only for latency benchmark. As we can compare it with other LLM inference framworks, is there any demo for offline throughput? I have compiled llm engine with option --use_inflight_batching._

You can refer here to measure the perf with inflight batching.

  1. Is build option --use_inflight_batching compatible with --use_gpt_attention_plugin bfloat16? The README said Note that in-flight batching in C++ runtime works only with attention plugin --use_gpt_attention_plugin=float16, paged KV cache --paged_kv_cache and with packed data --remove_input_padding.

Yes, --use_inflight_batching is compatible with --use_gpt_attention_plugin bfloat16. It requires some special logic which currently is only supported by GPT Attention plugin, that's why there is such a requirement.

77h2l commented 10 months ago

@juney-nvidia https://aistudio.baidu.com/projectdetail/5017442 is there lack of a necessary step of cmake command? since the original repo has no rule to make under build dir

zhaoyang-star commented 10 months ago

@juney-nvidia Thanks for your kind help. The good news: I have got the fixed BatchSize/InputLen/OutputLen benchmark by running .cpp/build/benchmarks/gptSessionBenchmark.

llama-7b A100-40GB Fixed BatchSize/InputLen/OutputLen
batch_size 1 input_length 256 output_length 256 latency(ms) 3105.15
batch_size 1 input_length 512 output_length 512 latency(ms) 6403.17
batch_size 8 input_length 256 output_length 256 latency(ms) 3607.19
batch_size 8 input_length 512 output_length 512 latency(ms) 7847.26

There are still several questions:

  1. I tried to launch Batch Manager benchmarking. What dataset is needed for prepare_dataset.py? Could you please give a demo?
  2. What is the difference between gptSessionBenchmark(cpp) and benchmark.py(python)? benchmark.py can also give the Fixed BatchSize/InputLen/OutputLen latency.
  3. How could user do online throughput benchmark? I see the link is very helpful.

Thanks in advance.

jdemouth-nvidia commented 10 months ago

1/ You can try the CNN Daily dataset,

2/ We currently focus our efforts on the C++ code so it’s the most up to date. We’d love to provide users with both solutions but we cannot afford to build the two stacks in parallel. So, we recommend to use the C++ stack.

3/ Happy if you found the blog post useful. I like it too. For the online throughput, we will probably publish a tool in the future but it’s not ready.

zhaoyang-star commented 10 months ago

1/ You can try the CNN Daily dataset,

@jdemouth-nvidia The key in CNN are id, article and highlights. While the prepare_dataset.py expects key are input, instruction and output. So CNN Daily dataset may not be used directly.

2/ We currently focus our efforts on the C++ code so it’s the most up to date. We’d love to provide users with both solutions but we cannot afford to build the two stacks in parallel. So, we recommend to use the C++ stack.

Got it!

3/ Happy if you found the blog post useful. I like it too. For the online throughput, we will probably publish a tool in the future but it’s not ready.

Keep tuned.

ishaan-jaff commented 9 months ago

If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests (i'd love feedback if people on this thread are trying to do this)

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)