Is there a way to "store" context to reuse later?

LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

4.66k stars 334 forks source link

Is there a way to "store" context to reuse later? #706

Open AllesMeins opened 6 months ago

AllesMeins commented 6 months ago

My current project includes two alternating prompts I need to send. The first one is basically a normal chat, but I want to use a second call to the AI to interpret the answer - something along the lines:

First prompt: Write an answer to this conversation Second prompt: Decide whether this answer is hostile or not

Of cause as soon as I send the second prompt I can't reuse the context from the first one and I need to wait for BLAS to ingest the whole conversation all over again. Therefore I was wondering if it is possible to somehow "store and load" the context of one prompt getting quicker response times when I get back to it again.

LostRuins commented 6 months ago

Unfortunately the only way to do that in koboldcpp would be to run two instances of the server on different ports.

AllesMeins commented 6 months ago

But that would also mean double the RAM/VRAM requirement, right?

Feb 22, 2024 15:57:16 LostRuins @.***>:

Unfortunately the only way to do that in koboldcpp would be to run two instances of the server on different ports.

— Reply to this email directly, view it on GitHub[https://github.com/LostRuins/koboldcpp/issues/706#issuecomment-1959626039], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AARIGVDGLDAF6RFAX45VIADYU5MEPAVCNFSM6AAAAABDUV6TDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGYZDMMBTHE]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AARIGVBAFI22HREKA7DAD7LYU5MEPA5CNFSM6AAAAABDUV6TDWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTUZWCTO.gif]

LostRuins commented 6 months ago

Maybe. With mmap on it's possible some might be shared.

aleksusklim commented 6 months ago

I can confirm that I was able to run the largest Mixtral on 64k context with CuBlas and 0 offloaded layers – the RAM consumption was near to 75 Gb of my 128 Gb RAM.

When running two instances at different ports, the used memory was more than 100 Gb (can't remember the exact value) and I could use any of them without any problems or slowdowns, exactly as OP wants to do this.

However, if I check Disable MMAP – the second instance cannot load because of OOM.