Open shenjiangqiu opened 11 months ago
@shenjiangqiu do you enable --paged_kv_cache
? If yes, please use CPP runtime. If not, do you modify any code?
@shenjiangqiu do you enable
--paged_kv_cache
? If yes, please use CPP runtime. If not, do you modify any code?
Hi, I'm not using the paged_kv_cache. I use the beam-search.
Do you test on main branch?
Hi @shenjiangqiu would u please try the latest code base to see if the issue still exists.
And do u still have further issue or question now? If not, we'll close it soon.
For example, in /examples/llama/run.py, if you run generate(xxx) in a loop, the GPU memory usage will constantly grow up after each run. There should be some memory leaking in the generate function.