Support configuring whether to load the entire model into memory or use mmap

xISSAx commented 1 year ago

Greetings, Love the application and UX!

I noticed Llama cpp running on my M1 was flushing the memory during and after each generation causing slower-than-expected outputs. This can be fixed by passing "-mlock" argument, which massively boosts Mac M1 performance by locking the model into the memory.

However, currently, LlamaChat has a similar issue, and I believe it can be fixed by passing a simple '-mlock' argument. In fact, I suggest leaving it ON by default for a seamless beginner's experience for M1s.

Moreover, please also consider an advanced feature to allow users to change the parameters.

alexrozanski commented 1 year ago

Thanks @xISSAx. You're right, LlamaChat sets the mlock parameter to false always, since this was touted as a big performance improvement over the previous versions (which for large models I think is true)?

I need to do some more investigation into this, but I was definitely thinking of adding a switch for this. Perhaps you're right, maybe this should be enabled by default for a good FTUE, but configurable if people need it.

alexrozanski commented 1 year ago

Added in v1.2.0

alexrozanski / LlamaChat

Support configuring whether to load the entire model into memory or use mmap #4