LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 312 forks source link

Support loading multiple models #846

Open Martmists-GH opened 1 month ago

Martmists-GH commented 1 month ago

At the moment, calling load_model a second time causes it to overwrite the previous model. Instead, load_model should return a pointer to some handle struct, and this handle could then be passed around to the other functions in order to invoke actions on it.

If the goal is to maintain backwards compatibility, the existing functions could reference a static handle address while the core logic supports any number of active handles.

LostRuins commented 1 month ago

Actually, the python and C++ methods are not intended to be used by other programs directly as they may be subject to change without warning, rather the API is the preferred way to do it.

But regarding that point - the problem with this is that I don't have a way to effectively free resources taken by a model, which may be partially offloaded to different devices, CPU, GPU etc. Even attempting to unload a dll does not fully release the allocated resources, and the existing deallocation code in GGML leaks memory. The only surefire way to do it right now would be to launch the backend as a separate subprocess, which brings with it the issue of inter-process communication.