Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.75k stars 830 forks source link

llamafile vs llamacpp: the results model generate are different #471

Open chong000 opened 2 weeks ago

chong000 commented 2 weeks ago

Background: Using the same gguf model with the same parameters and inputs, using -- top k=1 (greedy strategy); llamafile-0.8.6 llamacpp-b2249 When generating the first token, the distribution of logits generated by llamacpp and llamacfile is different;

llama.cpp (gdb) p logits_out. data() [0] $1=-7.85015535 (gdb) p logits_out. data() [1] $2=-3.79276466 (gdb) p logits_out. data() [2] $3=-9.46714878 (gdb) p logits_out. data() [3] $4=-9.61338234 (gdb) p logits_out. data() [4] $5=-7.74912691

llamafile: (gdb) p logits_out [0] $4=-8.0756588 (gdb) p logits_out [1] $5=-3.83499479 (gdb) p logits_out [2] $6=-9.46789455 (gdb) p logits_out [3] $7=-9.51721096 (gdb) p logits_out [4] $8=-7.68155956

Llamafile improves the processing speed of prompt, but compared to llamacpp, the generated accuracy decreases; Is this normal?

jart commented 2 weeks ago

Compare with:

https://github.com/Mozilla-Ocho/llamafile/blob/c38feb4f4896216458b77665aca532897476c040/llama.cpp/README.llamafile#L11-L13

Please include specific instructions for reproducing the difference. How do you know the accuracy has decreased? What if our accuracy is better? What weights are you using? What quant are you using?

chong000 commented 2 weeks ago

llamafile-0.8.6 vs llamacpp-b2249 I get this accuracy issue when migrating the yuan2.0-2b (https://huggingface.co/IEITYuan/Yuan2-2B-Februa-hf/tree/main ) to llamafile; I also tested chinese-alpaca-2-1.3b-f16.gguf (https://huggingface.co/hfl/chinese-alpaca-2-1.3b-gguf/tree/main); The results are as follows:

The same input, the same gguf file, the output distribution is not completely consistent; By comparing the results with each operator, it was found that GGML_OP_MUL_MAT caused calculation errors; The model with more layers will has greater cumulative error in the final logit distribution; The chinese-alpaca-2-1.3b only has 4 layers, while the yuan-2.0-2b have 24 layers;