Open Sixzero opened 6 months ago
Would something like the following: https://github.com/google/gemma.cpp/issues/23 be happening here?
Basically the way quantization is implemented seems to result in lower perf on some types of architectures.
Exactly. It seems that quantizing the hidden state to q8_0 is not a good idea (see https://github.com/ggerganov/llama.cpp/issues/4755; it is unfortunate that the bot closed this). We should rewrite our quantized vecdot routines to do the calculations in fp16 or fp32. The challenge here is to not degrade the speed of the vecdots too much.
FWIW we're (gemma.cpp) actually using fp32.
With https://github.com/cafaxo/Llama2.jl/commit/42001c59064aabf4805ab7454c5a1d117d6c6d3c, the zero-temperature behavior now better matches the Metal backend of llama.cpp:
Llama2.jl (at 42001c59064aabf4805ab7454c5a1d117d6c6d3c):
julia> sample(model, "The Julia programming language."; temperature=0.0f0)
The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
## Installation
### Installing Julia
#### Installing Julia from the Julia website
llama.cpp (at https://github.com/ggerganov/llama.cpp/commit/637e9a86c220718d008b54842dfd294aa96d3b7a):
The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
This is using the llama-2-7b.Q4_K_S.gguf
model.
We need to find a way to detect what could cause the differences between the two solutions.
The task is to have the same or near similar results at temp=0. We made some tests with the new
.gguf
files since it got so huge adoption.Llama2.jl test:
llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."
Current Llama2.jl results:
Current llama.cpp results:
We need to find an efficient way to know what could cause the differences between the two.