llamafile as LLM server.

Unfortunately koboldcpp with cuda crashes on my pc because my processor doesn't support avx2, while the other "blas" are too slow. So as an alternative i use llamafile, is working nice and smart, is very light and very performing on my 3060 with 12gb. The only problem is that every time I have to start a conversation, in order for the llm to generate the response, I have to briefly "alt+tab" to "exit and re-enter the game" so that llamafile generates the response and it triggers the loop with speech, it also works for multiple comments, but then after it asks a new question, I have to "alt+tab" again to trigger the llm. I was wondering what it could be and if there is a way to overcome this problem.

art-from-the-machine / Mantella

llamafile as LLM server. #277