Hot Swapping Models between RAM and VRAM?

Models like Pygmalion require a lot of VRAM which is comparatively an expensive resource to RAM. On my station I have 64gb of RAM and 24gb of VRAM (RTX 4090).

I don't like to unload the model too often to avoid wear on my SSD's, but also, the models use a lot of VRAM, which can sometimes get in the way of other tasks I might be doing with my computer.

What I'd like is the ability to hot swap models between RAM and VRAM on demand. So when I'm not using Kobold for a while, I could move the model into RAM freeing up my VRAM for other tasks. This could be achieved via an API endpoint or something.

I imagine it would be faster to load the model out of RAM, which I have an abundance of anyway, vs loading it off a drive.

I'm unsure how practical this suggestion is, or how long it would take to load a model from RAM into VRAM or vice versa. But if the savings are significant vs pulling it off a disk, I think it would be a worthwhile feature?

KoboldAI / KoboldAI-Client

Hot Swapping Models between RAM and VRAM? #321