ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.74k stars 9.44k forks source link

Do not recreate context while LLama is writing #828

Closed janekb04 closed 5 months ago

janekb04 commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Tokens are generated at about a constant rate, ie. N tokens per second on a given machine.

Current Behavior

Sometimes, the LLM takes a much longer time to generate a token than usually. It can be a 10x slowdown.

Environment and Context

Setup MacBook Pro 14-inch 2021 10-core Apple M1 Pro CPU 16 GB RAM OS MacOS Ventura 13.3 (22E252) clang --version

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix

Steps to Reproduce

Run ./main -m ./models/ggml-vicuna-7b-4bit-rev1.bin -n 512 --color -f prompts/chat-with-vicuna.txt --seed 42 --mlock

The model will get stuck after "of": ...or visit one of▏ the city's many restaurants...

Failure Logs

main: seed = 42
llama_model_load: loading model from './models/ggml-vicuna-7b-4bit-rev1.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/ggml-vicuna-7b-4bit-rev1.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 512, n_keep = 0

 A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: Hello, Assistant.
### Assistant: Hello. How may I help you today?
### Human: Please tell me the largest city in Europe.
### Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia.
### Human: Write a description of it for tourists
### Assistant: Welcome to Moscow! Located in the heart of Europe, this vibrant city is full of history, culture, and things to do. Whether you're interested in art, architecture, or outdoor activities, there's something for everyone in Moscow. Start your visit at Red Square, home to some of the city's most famous landmarks, including St. Basil's Cathedral and the Kremlin. Take a stroll through the historic neighborhood of Kitai-Gorod, where you can find plenty of shops and restaurants. Visit the Tretyakov Gallery to see some of Russia's most famous artwork, or take a trip to the outskirts of the city to explore the beautiful parks and gardens. Don't forget to try some of Moscow's delicious local cuisine, including borscht (beet soup) and balalaika (a type of stringed instrument).
### Human: No, please write about Amsterdam as if you are a tourist guide
### Assistant: Welcome to Amsterdam, the vibrant capital of the Netherlands! Known for its iconic canals, bustling nightlife, and liberal culture, this city is a must-visit destination for any traveler. Start your visit at Dam Square, home to some of Amsterdam's most famous landmarks, including the Royal Palace, the National Monument, and the New Church. Take a stroll along the canals, which are lined with charming homes, cafes, and shops. Visit the Van Gogh Museum to see the largest collection of Vincent van Gogh's paintings and letters in the world. Or take a boat tour of the city's many canals and historical sites. Don't forget to try some of Amsterdam's famous street food, such as pancakes and waffles, or visit one of the city's many restaurants for a taste of the local cuisine. Amsterdam is also known for its lively nightlife, with trendy bars, clubs, and coffee shops galore. Don't be afraid to explore the city's Red Light District, a colorful and historic area that has been a part of Amsterdam since the Middle Ages. Overall, Amsterdam is a fascinating and unique destination that offers something for everyone.
### Human: No, please write about New York City as if
llama_print_timings:        load time = 22203.17 ms
llama_print_timings:      sample time =   372.37 ms /   512 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time = 14223.05 ms /   368 tokens (   38.65 ms per token)
llama_print_timings:        eval time = 43922.75 ms /   510 runs   (   86.12 ms per run)
llama_print_timings:       total time = 80063.80 ms

Video

ezgif com-optimize

ngxson commented 1 year ago

Seems like it's due to context swapping. The context limit of llama is 2048 tokens, after that they do a "context swapping":

            // infinite text generation via context swapping
            // if we run out of context:
            // - take the n_keep first tokens from the original prompt (via n_past)
            // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch

https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L256

janekb04 commented 1 year ago

@ngxson This makes sense. This is invisible in, say, ChatGPT because there, this context recreation happens only after it has finished writing - when it's the user's turn.

KASR commented 1 year ago

Would it make sense to track how full the context is in interactive mode? So that we could swap the context (or in this case clear a part of it) while the user is typing the next question?

janekb04 commented 1 year ago

It could also work like ChatGPT. There, the context is recreated every time the user sends a message. The tokens in the message are counted, the max response length is added to that and then as much history is prepended to that. Though I don't know what that would be performance-wise, as context recreation seems rather expensive in llama.cpp.

Here, I think a better solution would be to recreate the context as soon as LLama stops typing. We would assume that the user's query + LLama's response must be no longer than a certain limit.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.