Reproduce temp=0 llama.cpp results with some consistency.

Sixzero commented 6 months ago

We need to find a way to detect what could cause the differences between the two solutions.

The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf files since it got so huge adoption.

Llama2.jl test:

using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)

llama.cpp test .gguf test: ./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."

Current Llama2.jl results:

Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.

Current llama.cpp results:

Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]

We need to find an efficient way to know what could cause the differences between the two.

krishvishal commented 3 months ago

Would something like the following: https://github.com/google/gemma.cpp/issues/23 be happening here?

Basically the way quantization is implemented seems to result in lower perf on some types of architectures.

cafaxo commented 3 months ago

Exactly. It seems that quantizing the hidden state to q8_0 is not a good idea (see https://github.com/ggerganov/llama.cpp/issues/4755; it is unfortunate that the bot closed this). We should rewrite our quantized vecdot routines to do the calculations in fp16 or fp32. The challenge here is to not degrade the speed of the vecdots too much.

jan-wassenberg commented 3 months ago

FWIW we're (gemma.cpp) actually using fp32.

cafaxo commented 2 months ago

With https://github.com/cafaxo/Llama2.jl/commit/42001c59064aabf4805ab7454c5a1d117d6c6d3c, the zero-temperature behavior now better matches the Metal backend of llama.cpp:

Llama2.jl (at 42001c59064aabf4805ab7454c5a1d117d6c6d3c):

julia> sample(model, "The Julia programming language."; temperature=0.0f0)
 The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
## Installation

### Installing Julia

#### Installing Julia from the Julia website

llama.cpp (at https://github.com/ggerganov/llama.cpp/commit/637e9a86c220718d008b54842dfd294aa96d3b7a):

 The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.

This is using the llama-2-7b.Q4_K_S.gguf model.

cafaxo / Llama2.jl

Reproduce temp=0 llama.cpp results with some consistency. #28