model.close() Fails to Release Memory from ChatHandler Projector in Multimodal Models

Prerequisites

Before submitting, please confirm that:

[x] You are using the latest version of the code. No tagged versions exist as development is ongoing.
[x] You have carefully followed the instructions in the README.md.
[x] You have searched through open and closed issues using relevant keywords to ensure this is a new issue.
[x] You have reviewed the Discussions and are submitting a new bug report or enhancement.

Expected Behavior

When calling model.close(), the VRAM used by both the model and the associated projector model in a ChatHandler (for multimodal models) should be fully released.

Current Behavior

When using a multimodal model with a ChatHandler (e.g., moondream2), the model.close() method correctly releases the VRAM used by the main model but fails to release the VRAM used by the projector model within the ChatHandler. This results in residual memory usage and eventual exhaustion of VRAM, especially after multiple model loads and closures.

Steps to Reproduce

Load the model with a ChatHandler for a MultiModal model (MoonDream or minicpm-v):

chat_handler = MoondreamChatHandler({
    "clip_model_path": "/llm_models/minicpm-v/minicpmv-8b-projector_f16.gguf",
})
model = Llama(
    model_path="/llm_models/moondream2/moondream:1.8b-model-4.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    chat_handler=chat_handler,
)

After performing inference, call model.close():

model.close()

Issue:

The model.close() call releases the VRAM used by the main model but does not release the VRAM occupied by the ChatHandler's projector model.
Memory continues to accumulate from the projection file, leading to eventual VRAM exhaustion.

Suggested Fix or Enhancement

The model.close() function should ensure that all resources, including those used by the ChatHandler's projector, are properly deallocated.

abetlen / llama-cpp-python