Open zhaoyang-star opened 1 year ago
_> 1. The scripts in benchmarks
are only for latency benchmark. As we can compare it with other LLM inference framworks, is there any demo for offline throughput? I have compiled llm engine with option --use_inflight_batching
._
You can refer here to measure the perf with inflight batching.
- Is build option
--use_inflight_batching
compatible with--use_gpt_attention_plugin bfloat16
? The README said Note that in-flight batching in C++ runtime works only with attention plugin--use_gpt_attention_plugin=float16
, paged KV cache--paged_kv_cache
and with packed data--remove_input_padding
.
Yes, --use_inflight_batching
is compatible with --use_gpt_attention_plugin bfloat16
. It requires some special logic which currently is only supported by GPT Attention plugin, that's why there is such a requirement.
@juney-nvidia https://aistudio.baidu.com/projectdetail/5017442 is there lack of a necessary step of cmake command? since the original repo has no rule to make under build dir
@juney-nvidia Thanks for your kind help.
The good news: I have got the fixed BatchSize/InputLen/OutputLen benchmark by running .cpp/build/benchmarks/gptSessionBenchmark
.
llama-7b A100-40GB Fixed BatchSize/InputLen/OutputLen
batch_size 1 input_length 256 output_length 256 latency(ms) 3105.15
batch_size 1 input_length 512 output_length 512 latency(ms) 6403.17
batch_size 8 input_length 256 output_length 256 latency(ms) 3607.19
batch_size 8 input_length 512 output_length 512 latency(ms) 7847.26
There are still several questions:
gptSessionBenchmark(cpp)
and benchmark.py(python)
? benchmark.py
can also give the Fixed BatchSize/InputLen/OutputLen latency.Thanks in advance.
1/ You can try the CNN Daily dataset,
2/ We currently focus our efforts on the C++ code so it’s the most up to date. We’d love to provide users with both solutions but we cannot afford to build the two stacks in parallel. So, we recommend to use the C++ stack.
3/ Happy if you found the blog post useful. I like it too. For the online throughput, we will probably publish a tool in the future but it’s not ready.
1/ You can try the CNN Daily dataset,
@jdemouth-nvidia The key in CNN are id, article and highlights. While the prepare_dataset.py expects key are input, instruction and output. So CNN Daily dataset may not be used directly.
2/ We currently focus our efforts on the C++ code so it’s the most up to date. We’d love to provide users with both solutions but we cannot afford to build the two stacks in parallel. So, we recommend to use the C++ stack.
Got it!
3/ Happy if you found the blog post useful. I like it too. For the online throughput, we will probably publish a tool in the future but it’s not ready.
Keep tuned.
If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests (i'd love feedback if people on this thread are trying to do this)
Here's how to use it Docs: https://docs.litellm.ai/docs/routing
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
hi @zhaoyang-star do u still have further issue or question now? If not, we'll close it soon.
benchmarks
are only for latency benchmark. As we can compare it with other LLM inference framworks, is there any demo for offline throughput? I have compiled llm engine with option--use_inflight_batching
.--use_inflight_batching
compatible with--use_gpt_attention_plugin bfloat16
? The README said Note that in-flight batching in C++ runtime works only with attention plugin--use_gpt_attention_plugin=float16
, paged KV cache--paged_kv_cache
and with packed data--remove_input_padding
.