NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.45k stars 957 forks source link

free_gpu_memory_fraction not working for examples/apps/openai_server.py #2350

Open anaivebird opened 3 days ago

anaivebird commented 3 days ago

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

cd examples/apps
python3 ./openai_server.py  /tmp/engine/qwen-trt-engine-fusion-1gpu --tokenizer /tmp/qwen-7b

Expected behavior

gpu free memory will be more when free_gpu_memory_fraction set to lower in examples/apps/openai_server.py

actual behavior

gpu free memory does not change

additional notes

no

syuoni commented 3 days ago

HI @anaivebird ,

You are right. We should pass the kv_cache_config to the LLM constructor here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/apps/openai_server.py#L468-L474. Let me fix it.