KoboldAI / KoboldAI-Client

For GGUF support, see KoboldCPP: https://github.com/LostRuins/koboldcpp
https://koboldai.com
GNU Affero General Public License v3.0
3.46k stars 747 forks source link

Hot Swapping Models between RAM and VRAM? #321

Closed Sayayaya closed 1 year ago

Sayayaya commented 1 year ago

Models like Pygmalion require a lot of VRAM which is comparatively an expensive resource to RAM. On my station I have 64gb of RAM and 24gb of VRAM (RTX 4090).

I don't like to unload the model too often to avoid wear on my SSD's, but also, the models use a lot of VRAM, which can sometimes get in the way of other tasks I might be doing with my computer.

What I'd like is the ability to hot swap models between RAM and VRAM on demand. So when I'm not using Kobold for a while, I could move the model into RAM freeing up my VRAM for other tasks. This could be achieved via an API endpoint or something.

I imagine it would be faster to load the model out of RAM, which I have an abundance of anyway, vs loading it off a drive.

I'm unsure how practical this suggestion is, or how long it would take to load a model from RAM into VRAM or vice versa. But if the savings are significant vs pulling it off a disk, I think it would be a worthwhile feature?

henk717 commented 1 year ago

This is what already happens when you don't put all layers on the GPU.