NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.68k stars 991 forks source link

Memory Leaking in the example #491

Open shenjiangqiu opened 11 months ago

shenjiangqiu commented 11 months ago

For example, in /examples/llama/run.py, if you run generate(xxx) in a loop, the GPU memory usage will constantly grow up after each run. There should be some memory leaking in the generate function.

wm2012011492 commented 11 months ago

@shenjiangqiu do you enable --paged_kv_cache? If yes, please use CPP runtime. If not, do you modify any code?

shenjiangqiu commented 11 months ago

@shenjiangqiu do you enable --paged_kv_cache? If yes, please use CPP runtime. If not, do you modify any code?

Hi, I'm not using the paged_kv_cache. I use the beam-search.

byshiue commented 11 months ago

Do you test on main branch?

nv-guomingz commented 16 hours ago

Hi @shenjiangqiu would u please try the latest code base to see if the issue still exists.

And do u still have further issue or question now? If not, we'll close it soon.