Open chong000 opened 2 weeks ago
Compare with:
Please include specific instructions for reproducing the difference. How do you know the accuracy has decreased? What if our accuracy is better? What weights are you using? What quant are you using?
llamafile-0.8.6 vs llamacpp-b2249 I get this accuracy issue when migrating the yuan2.0-2b (https://huggingface.co/IEITYuan/Yuan2-2B-Februa-hf/tree/main ) to llamafile; I also tested chinese-alpaca-2-1.3b-f16.gguf (https://huggingface.co/hfl/chinese-alpaca-2-1.3b-gguf/tree/main); The results are as follows:
llamafile:
modify common.cpp:2383 for fix input:
input: std::vector
gdb:run -m /mnt/md0/sc/ckpts/gguf/chinese-alpaca-2-1.3b-f16.gguf -c 4096 -b 4096 -t 1 -n 100 --precise -p '北京简介'
breakpoint:llama.cpp:11184
ggml_backend_tensor_get_async(backend_res, res, logits_out, 0, n_outputs_new*n_vocab*sizeof(float));
(gdb) p logits_out[0]
$1 = 0.547848821
(gdb) p logits_out[1]
$2 = 0.231845081
(gdb) p logits_out[2]
$3 = 11.2775717
(gdb) p logits_out[3]
$4 = 2.33590698
(gdb) p logits_out[4]
$5 = -1.11707854
(gdb) p logits_out[5]
$6 = -1.94668245
llamacpp:
modify common.cpp:1360, same input:
input: std::vector
gdb:run -m /mnt/md0/sc/ckpts/gguf/chinese-alpaca-2-1.3b-f16.gguf -c 4096 -b 4096 -t 1 -n 100 -p '北京简介'
breakpoint:llama.cpp:8015
ggml_backend_tensor_get_async(res_backend, res, logits_out.data(), (n_vocab*(n_tokens - 1))*sizeof(float), n_vocab*sizeof(float));
(gdb) p logits_out.data()[0]
$2 = 0.544902027
(gdb) p logits_out.data()[1]
$3 = 0.22588858
(gdb) p logits_out.data()[2]
$4 = 11.2715759
(gdb) p logits_out.data()[3]
$5 = 2.33415461
(gdb) p logits_out.data()[4]
$6 = -1.11890757
(gdb) p logits_out.data()[5]
$7 = -1.94703293`
The same input, the same gguf file, the output distribution is not completely consistent; By comparing the results with each operator, it was found that GGML_OP_MUL_MAT caused calculation errors; The model with more layers will has greater cumulative error in the final logit distribution; The chinese-alpaca-2-1.3b only has 4 layers, while the yuan-2.0-2b have 24 layers;
Background: Using the same gguf model with the same parameters and inputs, using -- top k=1 (greedy strategy); llamafile-0.8.6 llamacpp-b2249 When generating the first token, the distribution of logits generated by llamacpp and llamacfile is different;
llama.cpp (gdb) p logits_out. data() [0] $1=-7.85015535 (gdb) p logits_out. data() [1] $2=-3.79276466 (gdb) p logits_out. data() [2] $3=-9.46714878 (gdb) p logits_out. data() [3] $4=-9.61338234 (gdb) p logits_out. data() [4] $5=-7.74912691
llamafile: (gdb) p logits_out [0] $4=-8.0756588 (gdb) p logits_out [1] $5=-3.83499479 (gdb) p logits_out [2] $6=-9.46789455 (gdb) p logits_out [3] $7=-9.51721096 (gdb) p logits_out [4] $8=-7.68155956
Llamafile improves the processing speed of prompt, but compared to llamacpp, the generated accuracy decreases; Is this normal?