Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.39k stars 982 forks source link

Follow-up answers are slow #408

Closed woheller69 closed 2 months ago

woheller69 commented 4 months ago

I have a CPU only setup, so my system is quite slow. I notice that llamafile is much slower than gpt4all for follow-up answers.

e.g. I ask (using Dolphin 2.7 Mixtral 8x7b with its lengthy system message): A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. How can they cross the river without anything being eaten?

First prompt evaluation with llamafile is about the same as for gpt4all (about 80s)

But when I reply to the answer telling that it is wrong, llamafile (60s) takes about as long for prompt processing as for the first answer, while gpt4all (6s) answers almost immediately.

d-z-m commented 3 months ago

Sounds like you might be overflowing the context window. What context size are you running llamafile with? How many tokens is your prompt? How many tokens is the first response?