LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

Feature Request: Expose llama.cpp --no-mmap option #37

Closed TrajansRow closed 1 year ago

TrajansRow commented 1 year ago

There was a performance regression in earlier versions of llama.cpp that I may be hitting with long running interactions. This was recently fixed with the addition of a --no-mmap option which forces the entire model to be loaded into ram, and I would like to also use it with koboldcpp.

https://github.com/ggerganov/llama.cpp/pull/801

LostRuins commented 1 year ago

This should be fixed in the latest version, currently v1.3 at time of writing. In fact, I have made mmap disabled by default. To enable it, use the flag --usemmap https://github.com/LostRuins/koboldcpp/releases/latest

Edit: for version 1.4 it is now mmap by default, you can toggle it off with --nommap instead.

TrajansRow commented 1 year ago

Thanks for the update! I'm seeing much better memory utilization now, although not quite the same sort of performance improvement I saw in llama.cpp (maybe a generation speedup on the order of 10ms/token, running 30B/4-bit llama on an M1 Max).

This is a great option to have in koboldcpp, but I don't think it should be enabled by default. For the majority of users, I don't expect the memory tradeoffs to provide meaningful benefit.

LostRuins commented 1 year ago

Yeah it was quite a divided response, some people hate it others love it. In the end it is a toggle so everyone can just pick whichever option they prefer.