Closed oKatanaaa closed 1 year ago
There is most definitely something wrong in the way prompt is fed into the program. Changes in the batch size can affect the output. Examples:
$ ./main -s 1000 -m models/7B/ggml-model-q4_0.bin --top_k 1 -b 1 -p "This is a sample prompt that expects continuation:" "I'm going to have you write about your favorite memory. You can use any format, but I want it in my office by 10am tomorrow." (or whatever time) "You may start now..." [end of text] $ ./main -s 1000 -m models/7B/ggml-model-q4_0.bin --top_k 1 -b 100 -p "This is a sample prompt that expects continuation:" "The first time I saw you, it was love at first sight. You were so beautiful and charming." This one has an expectation of completion too! It's just the beginning... [end of text]
I was experimenting with changing the word selection by forcing the most likely output every time using --top_k 1 to eliminate next word sampling.
My guess is that batch size = 1 will give the more "correct" behaviour of the model.
@alankila I've noticed a similar behavior as batch_size=1
seemed to be more correct. But I was just eye-balling the results, the only definitive way would be to test it like this: https://github.com/ggerganov/llama.cpp/pull/270
Hm, yes. I agree. However, I have an interactive assistant with a prompt working and if I put batch size 100 or whatever that eats the entire prompt at once, and all my conversation turns, the model is continuously rather confused and keeps making mistakes in reading what I write. I think it is pretty obvious that higher batch sizes do not work correctly presently.
I also think the defaults are not too good, which is somewhat an issue for this less scientific or quantified approach. To be honest, I can't get the regular --top_p 0.9 to stay on topic, and the repeat_last_n must be considerably lowered for chat mode, or it makes the AI generate the end of chat token instead of replying a lot of the time. My guess is that the "Bob:" text token generation likelihood is lowered too much, and it tends to choose to end the discussion instead.
I am currently using the following parameters with the Bob-like assistant string:
./main -m ./models/7B/ggml-model-q4_0.bin -b 1 --ctx_size 2048 --temp 1.0 --top_k 100 --top_p 0.7 --repeat_last_n 20 --repeat_penalty 1.2 -n 2048 --color -i -r "User:" -p "Transcript of dialog between blah blah blah"
At least in this way, with lower top_p parameter value, the model becomes normally quite coherent, and I can have long chats with it. After enjoying the fairly coherent chat afforded by this model, it is very obvious now that increasing batch size makes the model barely understand what I am saying to it.
I'm not sure I understand enough of the code to take this conclusion, but I think, the whole deal with batching is sacrificing quality to get speed. This mainly applies to optimizing by crossing the cpu<->gpu boundary less times. But in CPU inference, I'm not even sure batching is significantly faster. I don't feel it has been in my somewhat unscientific tests.
OTOH, to compute the attention mechanism for a token, you need the data for previous tokens to be there, because every token in the input must attend to every previous token. So if you batch them in groups of N, I can imagine there being a downgrade in quality, unless someone has made sure that the matrix multiplications are made in such a way that token data for tokens 0..n-1 is available when computing data for token N.
That's incorrect and it shouldn't sacrifice anything. It also should be faster on CPU. All Pytorch transformers I had to run on CPU were significantly faster at reading prompts than generating text. Transformer's architecture allows to compute activation of a single layer for a whole batch in one go. Under the hood actually there's three steps:
This is done for each layer, one by one, batch goes in, batch goes out.
A huge part of why transformers overtook RNNs is this property that allows training on whole data chunks in one pass.
A huge part of why transformers overtook RNNs is this property that allows training on whole data chunks in one pass.
@jarcen Ok, yeah, this makes sense in general terms and you seem to know more about it than me. Sorry I added noise to the discussion.
One question though, because I'd like to make sure I understood your point correctly:
Each token's destiny depends only on QKV matrices of itself and tokens that are placed before it.
When batching inputs, the "tokens that are placed before it" are part of the batch, and are being computed at the same time, isn't that correct? Or do you mean, the data dependency only goes to previous tokens in previous layers?
And leaving that aside, me (and others) have clearly observed output quality differences when varying the batch size. So if this is not an issue, theoretically speaking, then it may be a bug in the implementation?
They are not being computed at the same time. Computations in one layer are separated in three steps I listed above. Step 2 operates on Query-Key-Value matrices which were already created on step 1. Key-Value matrices are not a part of the batch anymore but part of hidden state. Each Query vector with associated position N looks for Key vectors at positions N, N-1, N-2, N-3, etc... That includes already existed K vectors and the ones that just got added from batch at step 1. If there's four Q vectors then four threads can do that in parallel, there's no data dependency between these threads.
(Note that I'm seemingly using Vector and Matrix interchangeably but matrices are essentially how batching is implemented: each row is a vector. So, individual per-token operations are explained with vectors.)
Example code expressing idea of self-attention with string operations:
List<int> FindAllOccurencesBefore(string str, char symbolToFind, int position) {
List<int> found = new List<int>();
while(position > 0) {
if(str[position] == symbolToFind)
found.Add(position);
position--;
}
return found;
}
This code can be run in parallel in multiple threads. One thread might start at position 5, another at 6, 7, 8 and so on. They do not conflict in any way. That is what essentially happens at step 2, except characters are Key vectors and symbolToFind
is a Query vector. String is already updated at step 1 with new elements appended from batch to the tail.
Now for the quality. Yes, it must be a bug somewhere. I read llama_eval
multiple times and can't find any error. I think it hides somewhere in ggml operators but reading that vectorized code is practically impossible.
Can you guys give a test with latest master
- I believe the results should now be the same for different batch sizes
Can you guys give a test with latest
master
- I believe the results should now be the same for different batch sizes
Latest master
generates different outputs (using the same seed and prompt, but a different batch size), still.
I also have this problem. Is this a limitation of llama.cpp? Why is this thread closed?
I was tinkering with the code and made the following change in
line 977, main.cpp
(as it seemed wrong to me): fromto
The model's (13B) outputs suddenly changed. Reverted changes and tried to play with the
batch_size
parameter, it really does affect the output.Not sure if it's expected behaviour. As far as I understand it shouldn't be the case. A bug? Different batch sizes have different evaluation results (rounding error)?