ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.39k stars 8.78k forks source link

Longer and infinite output #71

Closed leszekhanusz closed 11 months ago

leszekhanusz commented 1 year ago

If we use -n 1000000 to have a very long output (for a story for example), it stops generating quite fast, after around 30 lines, probably because of this line of code.

It would be nice if we could have longer outputs and also the possibility to have infinite output, stopping only on Ctrl-C. We could maybe specify that -n 0 will trigger that infinite output mode. That issue is a bit related to issue #23

setzer22 commented 1 year ago

there is no publicly known method. ChatGPT could be doing something like this, but it might as well be doing context swap + sumarization for all we know, no?

Agreed there. It's very hard to know anything for sure and OpenAI isn't going to tell anyone.

This is just a suspicion on my end based on interactions with the tool. One think that makes me think they're not using something similar to the swap strategy implemented here is that there's never a clear point in the conversation where a lag spike occurs, but I'm also guessing there are ways to tick users by hiding the latency. And they also seem to pull off other kinds of magic like effortlessly reading through several pages of text and start generating right away in less than a second, so maybe their trick is just having super fast inference :man_shrugging:

ggerganov commented 1 year ago

The lack of latency with ChatGPT can be explained by the high memory bandwidth of GPUs. On a GPU, the memory throughput is much higher compared to CPU and since the prompt evaluation is memory bound, I can easily see a high-end GPU being 10-100x times faster than a CPU for large context swaps. I expect the CPU - GPU difference for single token inference to be much smaller though

setzer22 commented 1 year ago

I see. That makes a lot of sense. And also explains why this project seems to be competitive in tokens/s with people running the oobabooga webui on GPU :thinking:

eshaanagarwal commented 1 year ago

Hi I am facing this issue while doing CPU Inference using GPT4ALL-1.3groovy model. Can somebody please help with that

jboero commented 5 months ago

I know this is closed but I just wanted to leave my $0.02 experience in case others come along. I run a workstation with 566GB RAM and Nvidia RTX 4070ti, usually using server (./examples/server/). The 4070ti with 12GB is great for chat and responses and reasonable models+prompt context size. If using completion instead of chat I regularly have good luck running CPU mode with a prompt context size of 400,000+ (~300GB RAM with derived 13Bx8 models) for writing long responses or writing book chapters. Sometimes it takes multiple completions (Hitting Start or forcing it to continue completion with hints) but it will write indefinitely up to ctx size if you coax it a bit like a choose your own adventure.

Example:

<EPICSTORY...> and then Johnny finished the code and they lived happily ever after. <END> this seems like a perfectly logical stopping point, so Llama may finish even if it has more context and -1 predictions specified. What you can do is either add to this and re-prompt completion or edit it and adjust the ending for something that clearly needs more explaining:

<EPICSTORY...> and then Johnny finished the code and they lived happily ever after until he found a bug and then

This will hint to the completion that it needs to keep going. It's great because you can sort of guide it or edit parts of the completion you would like adjusted.

sprappcom commented 5 months ago

@jboero can you provide an example llama model u are using that can have 400k context size? i thought context generated is limited by the llama model u use. max i know is 192k which i never get to touch coz it's way above my 8gb vram and 64gb ram.

how do you get 400k tokens generated?

jboero commented 5 months ago

Sorry I should have specified "prompt context" not model context. Ala --ctx-size arg. No I haven't actually used a model with 400k context.

-c N, --ctx-size N: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.
MrMBag commented 1 month ago

I realize I'm quite late to this party... but I'm able to get an infinite response running llama.cpp on my raspberry pi 4. When I load it up, I use these commands.

./main -m models/[your model]  --color \  --ctx-size 2048 \ -n -1 \ -ins -b 256 \ --top_k 10000 \ --temp 1.5 \ 
--repeat_penalty 1.1 \ --ignore-eos

I'm going to let you all know that I've been playing around with AI for literally the past 2 weeks, so I barely know what I'm doing, but I'm still learning. (In case you looked at those commands and gagged, laughed or asked yourself, "What in God's holy name is this moron doing?"). I'm kind of like the guy that in an emergency situation that can lift a car off of someone, because in that moment I'm not thinking about all the reasons I can't lift a car. What I mean is, I think I got llama.cpp to work in the first place by brute force and ignorance, so I can't explain why it works, it just does for me.

So I hope that helps anyone who knows less than I do, or it opens doors for someone who knows more than I do.

matti commented 3 weeks ago

@MrMBag thank you, that really helped to understand.