free_gpu_memory_fraction not working for examples/apps/openai_server.py

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.45k stars 957 forks source link

System Info

GPU： NVIDIA H100 80G
TensorRT-LLM branch main
TensorRT-LLM commit: 8681b3a4c0ccc1028bb48d83aacbb690af8f55e7

Who can help?

@byshiue

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cd examples/apps
python3 ./openai_server.py  /tmp/engine/qwen-trt-engine-fusion-1gpu --tokenizer /tmp/qwen-7b

Expected behavior

gpu free memory will be more when free_gpu_memory_fraction set to lower in examples/apps/openai_server.py

actual behavior

gpu free memory does not change

NVIDIA / TensorRT-LLM