ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.43k stars 9.68k forks source link

Bug: Erroneous Output in llama-cli #9848

Open ericcurtin opened 3 weeks ago

ericcurtin commented 3 weeks ago

What happened?

When using llama.cpp models (e.g., granite-code and llama3) with Nvidia GPU acceleration (nvidia/cuda:12.6.1-devel-ubi9 and RTX 3080 10GB VRAM), the models occasionally return nonsensical or garbled output after a few valid responses. This occurs even when the input prompts are simple, like basic arithmetic or listing prime numbers. Running the model using -ngl 50 in both configurations leads to the issue, suggesting it could be related to VRAM usage or GPU settings. This problem does not occur with Ollama’s GPU-accelerated version of llama3 using the exact same .gguf files.

The llama-cli command used is:

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix --in-suffix --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv

ramalama project issue:

https://github.com/containers/ramalama/issues/247

I don't think this kinda issue is Nvidia specific, in general Ollama seems to product higher quality responses than llama-cli.

Name and Version

$ llama-cli --version version: 3821 (70392f1f) built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for x86_64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv
> what is 2+2 answer the question only
444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv
> what is 2+2 answer the question and no more
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
ggerganov commented 3 weeks ago

Does it work after applying this patch:

diff --git a/src/llama.cpp b/src/llama.cpp
index da7afb1e..fde09bec 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9517,20 +9517,14 @@ static struct ggml_tensor * llm_build_kqv(
         cur = ggml_flash_attn_ext(ctx, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);

-        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_GEMMA2) {
-            ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
-        }
+        ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);

         cur = ggml_reshape_2d(ctx, cur, n_embd_head_v*n_head, n_tokens);
     } else {
         struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
         cb(kq, "kq", il);

-        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2 || model.arch == LLM_ARCH_NEMOTRON || model.arch == LLM_ARCH_CHATGLM) {
-            // for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
-            // ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
-            ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
-        }
+        ggml_mul_mat_set_prec(kq, GGML_PREC_F32);

         if (model.arch == LLM_ARCH_GROK) {
             // need to do the following:
slaren commented 3 weeks ago

Looks like a duplicate of #9838. The issue seems to be related to using a container.

ericcurtin commented 3 weeks ago

@bmahabirbu could you test this out?

ericcurtin commented 3 weeks ago

Yes it is a duplicate sorry, I wasn't aware @bmahabirbu logged it

bmahabirbu commented 3 weeks ago

@ericcurtin my apologies for not referencing you in the first issue. I'll try this patch @ggerganov thank you!

bmahabirbu commented 3 weeks ago

Unfortunately, the patch did not work. I have a feeling it's something to do with WSL2 not giving the necessary resources to the container.

slaren commented 3 weeks ago

If it works correctly without the container, then the container is the cause. You can try using the official dockerfile instead and see if it works with that.

bmahabirbu commented 3 weeks ago

Its also worth noting that using the official ollama container works.

slaren commented 3 weeks ago

As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.

bmahabirbu commented 3 weeks ago

Makes sense! Originally I used make because docker build couldnt find libcuda using cmake

ericcurtin commented 3 weeks ago

@ggerganov @slaren @JohannesGaessler

A question though that's somewhat related. llama-cli and ollama are two tools that use llama.cpp as a library. Using the exact same .gguf files with both, Ollama seems to have higher quality responses in general even when problems like above aren't encountered. We tend to call llama-cli like this in RamaLama project:

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv

What is it about the way Ollama uses llama.cpp that seems to generate better responses?

ggerganov commented 3 weeks ago

Likely different sampling parameters. These can have a high impact on the quality of the generated text. Try to match the settings between the two tools and see if this resolves the discrepancy.

bmahabirbu commented 3 weeks ago

As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.

Thanks a bunch @slaren! This is what fixed the issue! for reference this is what the relevant part of my new containerfile looks like

# CUDA_DOCKER_ARCH = 
# Turing GPUs (e.g., RTX 20 Series, GTX 16 Series): Use 75
# Ampere GPUs (e.g., RTX 30 Series, A100): Use 80
# Hopper GPUs (e.g., H100): Use 90
# Volta GPUs (e.g., V100): Use 70
# Pascal GPUs (e.g., GTX 10 Series): Use 61
# Kepler GPUs (e.g., GTX 600 and 700 Series): Use 35

# Followed https://github.com/ggerganov/llama.cpp/blob/master/.devops/full-cuda.Dockerfile
# for reference to build llama.cpp with cuda using cmake

RUN git clone https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DGGML_CUDA=ON CUDA_DOCKER_ARCH=80 -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
    cmake --build build --config Release -j$(nproc) && \
    cd build/bin && \
    mv llama-cli /usr/bin/llama-cli && \
    mv llama-server /usr/bin/llama-server && \
    cd / && \
    rm -rf llama.cpp

I also targeted docker_arch for my GPU instead of the default

ericcurtin commented 3 weeks ago

@bmahabirbu if you could test this also works with podman and open a PR on ramalama that would be great!