llamafile vs llamacpp: the results model generate are different

chong000 commented 2 weeks ago

Background: Using the same gguf model with the same parameters and inputs, using -- top k=1 (greedy strategy); llamafile-0.8.6 llamacpp-b2249 When generating the first token, the distribution of logits generated by llamacpp and llamacfile is different;

llama.cpp (gdb) p logits_out. data() [0] $1=-7.85015535 (gdb) p logits_out. data() [1] $2=-3.79276466 (gdb) p logits_out. data() [2] $3=-9.46714878 (gdb) p logits_out. data() [3] $4=-9.61338234 (gdb) p logits_out. data() [4] $5=-7.74912691

llamafile： (gdb) p logits_out [0] $4=-8.0756588 (gdb) p logits_out [1] $5=-3.83499479 (gdb) p logits_out [2] $6=-9.46789455 (gdb) p logits_out [3] $7=-9.51721096 (gdb) p logits_out [4] $8=-7.68155956

Llamafile improves the processing speed of prompt, but compared to llamacpp, the generated accuracy decreases; Is this normal?

jart commented 2 weeks ago

Compare with:

https://github.com/Mozilla-Ocho/llamafile/blob/c38feb4f4896216458b77665aca532897476c040/llama.cpp/README.llamafile#L11-L13

Please include specific instructions for reproducing the difference. How do you know the accuracy has decreased? What if our accuracy is better? What weights are you using? What quant are you using?

chong000 commented 2 weeks ago

llamafile-0.8.6 vs llamacpp-b2249 I get this accuracy issue when migrating the yuan2.0-2b (https://huggingface.co/IEITYuan/Yuan2-2B-Februa-hf/tree/main ) to llamafile; I also tested chinese-alpaca-2-1.3b-f16.gguf (https://huggingface.co/hfl/chinese-alpaca-2-1.3b-gguf/tree/main); The results are as follows:

llamafile:
modify common.cpp:2383 for fix input: input: std::vector tmp = { 29871, 32279, 39968 };

gdb：run -m /mnt/md0/sc/ckpts/gguf/chinese-alpaca-2-1.3b-f16.gguf -c 4096 -b 4096 -t 1 -n 100  --precise -p '北京简介'
breakpoint:llama.cpp:11184
ggml_backend_tensor_get_async(backend_res, res, logits_out, 0, n_outputs_new*n_vocab*sizeof(float));
(gdb) p logits_out[0]
$1 = 0.547848821
(gdb) p logits_out[1]
$2 = 0.231845081
(gdb) p logits_out[2]
$3 = 11.2775717
(gdb) p logits_out[3]
$4 = 2.33590698
(gdb) p logits_out[4]
$5 = -1.11707854
(gdb) p logits_out[5]
$6 = -1.94668245

llamacpp: modify common.cpp:1360, same input: input: std::vector tmp = { 29871, 32279, 39968 };

gdb：run -m /mnt/md0/sc/ckpts/gguf/chinese-alpaca-2-1.3b-f16.gguf -c 4096 -b 4096 -t 1 -n 100 -p '北京简介'
breakpoint:llama.cpp:8015
ggml_backend_tensor_get_async(res_backend, res, logits_out.data(), (n_vocab*(n_tokens - 1))*sizeof(float), n_vocab*sizeof(float));
(gdb) p logits_out.data()[0]
$2 = 0.544902027
(gdb) p logits_out.data()[1]
$3 = 0.22588858
(gdb) p logits_out.data()[2]
$4 = 11.2715759
(gdb) p logits_out.data()[3]
$5 = 2.33415461
(gdb) p logits_out.data()[4]
$6 = -1.11890757
(gdb) p logits_out.data()[5]
$7 = -1.94703293`

The same input, the same gguf file, the output distribution is not completely consistent; By comparing the results with each operator, it was found that GGML_OP_MUL_MAT caused calculation errors; The model with more layers will has greater cumulative error in the final logit distribution; The chinese-alpaca-2-1.3b only has 4 layers, while the yuan-2.0-2b have 24 layers;

Mozilla-Ocho / llamafile

llamafile vs llamacpp: the results model generate are different #471