Raise the generation speed as in recent updates to llama.cpp

intulint commented 1 year ago

I know that sooner or later it will be done. But I just wanted to play with the model in a convenient interface. And my calculator without speed boost thinks for a very long time. Maybe it's because it doesn't remember? I have 8 GB of RAM and the 4 GB model does not show what is loading the RAM. Maybe somehow you can force mlock to do it? By the way, the project was not going to be under Linux on a laptop with avx but without avx2, I had to edit the Makefile I removed the avx flag and the project was assembled, and in theory it should not be. But at startup it says that avx is enabled. Or the generation slows down due to the fact that it displays avx enabled, although in fact it is disabled

LostRuins commented 1 year ago

I usually merge the newest changes from llama.cpp every time I make a new release, so you can keep an eye on that.

namshub1 commented 1 year ago

Performance without -mlock option is still extremely bad. VICUNA-13B 4BIT in llama.cpp with -mlock almost answers instantly on M1 Pro Macbook with 16GB RAM. Same Prompt in Koboldcpp takes 1 to 2 minutes. -nommap from https://github.com/LostRuins/koboldcpp/issues/28 is not helping either.

LostRuins / koboldcpp

Raise the generation speed as in recent updates to llama.cpp #54