abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.8k stars 934 forks source link

model.close() Fails to Release Memory from ChatHandler Projector in Multimodal Models #1746

Closed cesarandreslopez closed 1 week ago

cesarandreslopez commented 1 week ago

Prerequisites

Before submitting, please confirm that:

Expected Behavior

When calling model.close(), the VRAM used by both the model and the associated projector model in a ChatHandler (for multimodal models) should be fully released.

Current Behavior

When using a multimodal model with a ChatHandler (e.g., moondream2), the model.close() method correctly releases the VRAM used by the main model but fails to release the VRAM used by the projector model within the ChatHandler. This results in residual memory usage and eventual exhaustion of VRAM, especially after multiple model loads and closures.

Steps to Reproduce

  1. Load the model with a ChatHandler for a MultiModal model (MoonDream or minicpm-v):
chat_handler = MoondreamChatHandler({
    "clip_model_path": "/llm_models/minicpm-v/minicpmv-8b-projector_f16.gguf",
})
model = Llama(
    model_path="/llm_models/moondream2/moondream:1.8b-model-4.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    chat_handler=chat_handler,
)
  1. After performing inference, call model.close():
model.close()

Issue:

Suggested Fix or Enhancement

The model.close() function should ensure that all resources, including those used by the ChatHandler's projector, are properly deallocated.

cesarandreslopez commented 1 week ago

Workaround:

Explicitly close the ChatHandler too like this:

chat_handler._exit_stack.close()

Probably it would be a good idea to have that close too when the whole model is closed too, but leaving the suggestion here in case someone encounters the same issue.