ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.43k stars 8.78k forks source link

Question: Why do GPU and CPU embedding outputs differ for the same input? Is normal? #7608

Open jygmysoul opened 1 month ago

jygmysoul commented 1 month ago

Prerequisites

Background Description

I am using the embedding example, the execution parameters are as follows embedding.exe -ngl 200000 -m I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf --log-disable -p "Hello World!"

The first three embedding values ​​are output when the CPU executes the embedding -4.67528416e-08 -1.07059577e-06 1.76811977e-06

The first three embedding values ​​are output when the GPU (-ngl 200000) executes the embedding 5.86615059e-08 -1.02221782e-06 1.78800110e-06

Why are the same "Hello World!" inputs different? Does llama.cpp currently correctly support GPU and CPU embedding?

Also, does llama.cpp have specific instructions for underlying API functions, or usage precautions? In addition to those on github, is there any interface documentation website? Thank you

Possible Answer

I think for the same input content, the GPU and CPU output embedding values ​​should be the same

jygmysoul commented 3 weeks ago

UP