abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.73k stars 931 forks source link

GPU memory released for llava multimodal #1451

Open adogwangwang opened 4 months ago

adogwangwang commented 4 months ago

When I start the llava13b model using the llama-cpp-python server, I notice that the GPU memory usage increases a little after each inference, which suggests that the GPU memory is not being released after each call. How should this be resolved? hope your help !!

abetlen commented 4 months ago

@adogwangwang could you provide more info on which backend (I'm assuming CUDA not Metal) and which version you're running.

adogwangwang commented 4 months ago

@adogwangwang could you provide more info on which backend (I'm assuming CUDA not Metal) and which version you're running. hello, I am useing llama-cpp-python 0.2.64, when I run llava 1.5 13B multimodal, here is my command: image

When I use the llava13b model, I notice that after each inference, there is an increase in the GPU memory usage, which suggests that the memory is not being released after each inference. Additionally, after multiple inferences, the model starts to give erratic responses. I've tried asking questions without adding any images, and surprisingly, the responses were related to the previous image. This indicates that the image variables were not cleared, leading to the unreleased memory and subsequent response confusion. I hope to get some clarification on this issue !!! @abetlen

LankyPoet commented 3 months ago

This happens with me with eris prime punch 9b on 4090 using cuda.