Support loading multiple models

LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI

GNU Affero General Public License v3.0

4.35k stars 312 forks source link

Actually, the python and C++ methods are not intended to be used by other programs directly as they may be subject to change without warning, rather the API is the preferred way to do it.

But regarding that point - the problem with this is that I don't have a way to effectively free resources taken by a model, which may be partially offloaded to different devices, CPU, GPU etc. Even attempting to unload a dll does not fully release the allocated resources, and the existing deallocation code in GGML leaks memory. The only surefire way to do it right now would be to launch the backend as a separate subprocess, which brings with it the issue of inter-process communication.

LostRuins / koboldcpp

Support loading multiple models #846