Closed hafezmg48 closed 2 weeks ago
I noticed similar behaviour on Silicon mac with llama 3 instruct. It looks like it is solved by adding the assistant header (see https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3)
./llama-batched -m ~/aitest/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -p "One of the important historic<|eot_id|><|start_header_id|>assistant<|end_header_id|>" -np 50 -ngl 99 --ctx-size 12800 -n 251
so it might be more related to the base llama 3.1 model than to llama.cpp inference
This is expected behaviour for a base model with the sampling configuration used by the llama-batched
example:
You don't see it with llama-cli
because it uses higher temperature by default: 0.8
.
What happened?
I am trying to run the llama-batched. It worked fine for small text sizes and small number of batches. But when having large batch numbers, after a certain amount of correct tokens, the sequences start to repeat themselves. In my example, I tried, -np 50 to have 50 batches and -n 251 to generate 251 tokens. I gave a single shared prompt "One of the important historic" to all of them. The tested model is llama3.1-8B and using CUDA gpu.
I tried to be careful with the context size and set --ctx-size 12800 to make sure there is enough shared kv cache for all batches, but it didnt help and many of the sequences are corrupted repetitions. For example:
or
below I also provided the full log of the run.
Name and Version
llama-batched -m ./llama3.1-8B-F16.gguf -p "One of the important historic" -np 50 -ngl 99 --ctx-size 12800 -n 251
What operating system are you seeing the problem on?
Linux
Relevant log output