[X] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Steps to Reproduce
Run the chat program:
$ examples/chat-13B.sh
main: seed = 1680315413
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 2048
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: type = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from './models/13B/ggml-model-q4_0.bin'
llama_model_load: model size = 7759.39 MB / num tensors = 363
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
sampling: temp = 0.700000, top_k = 40, top_p = 0.500000, repeat_last_n = 256, repeat_penalty = 1.176470
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.
Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa.
ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User’s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
User: Hello, ChatLLaMa!
ChatLLaMa: Hello User! How may I help you today?
User: What time is it?
ChatLLaMa: It is 10:16.
User: What year is it?
ChatLLaMa: We are in 2023.
User: Please tell me the largest city in Europe.
ChatLLaMa: The largest city in Europe is Moscow, the capital of Russia.
User: What can you tell me about Moscow?
ChatLLaMa: Moscow, on the Moskva River in western Russia, is the nation’s cosmopolitan capital. In its historic core is the Kremlin, a complex that’s home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
User: What is a cat?
ChatLLaMa: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
User: How do I pass command line arguments to a Node.js program?
ChatLLaMa: The arguments are stored in process.argv.
argv[0] is the path to the Node. js executable.
argv[1] is the path to the script file.
argv[2] is the first argument passed to the script.
argv[3] is the second argument passed to the script and so on.
User: Name a color.
ChatLLaMa: Blue
User:
Type this as a prompt:
Please rewrite the following statement in a customer friendly and polite way:\
\
I have looked at dmesg on other nodes. Unfortunately, c1node3, c1node5, and c1node7 indicate DIMM memory errors detected and corrected during previous boots. Additionally, c1node4 has SQUASHFS errors, as if the filesystem image was incompletely downloaded during the boot. On broken hardware, or with incompletely downloaded boot image, errors are just expected.\
\
Are you sure that the BIOS settings regarding the CPU clocks and voltages, as well as memory clocks, are in the range supported by the manufacturer? I.e. that there is no overclocking or undervolting going on, and the power supplies are all good?\
\
This leaves us with nodes 2 and 8, which have softlockups without any tainting factors.\
\
Could you please try to figure out when this kind of disaster stated, and whether there was any change to the hardware or software immediately before that?
And then, llama.cpp writes this line as if the User said it:
If we can pinpoint the exact moment it started happening, then maybe we could narrow down what changed around that time.
I expected the reply to start with ChatLLaMa:. Maybe llama.cpp should add the correct tokens explicitly, so that there is no need for the AI to predict that it is its turn to speak?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Steps to Reproduce
Run the chat program:
Type this as a prompt:
And then, llama.cpp writes this line as if the User said it:
I expected the reply to start with
ChatLLaMa:
. Maybe llama.cpp should add the correct tokens explicitly, so that there is no need for the AI to predict that it is its turn to speak?