Open ericcurtin opened 3 weeks ago
Does it work after applying this patch:
diff --git a/src/llama.cpp b/src/llama.cpp
index da7afb1e..fde09bec 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9517,20 +9517,14 @@ static struct ggml_tensor * llm_build_kqv(
cur = ggml_flash_attn_ext(ctx, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
- if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_GEMMA2) {
- ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
- }
+ ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
cur = ggml_reshape_2d(ctx, cur, n_embd_head_v*n_head, n_tokens);
} else {
struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
cb(kq, "kq", il);
- if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2 || model.arch == LLM_ARCH_NEMOTRON || model.arch == LLM_ARCH_CHATGLM) {
- // for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
- // ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
- ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
- }
+ ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
if (model.arch == LLM_ARCH_GROK) {
// need to do the following:
Looks like a duplicate of #9838. The issue seems to be related to using a container.
@bmahabirbu could you test this out?
Yes it is a duplicate sorry, I wasn't aware @bmahabirbu logged it
@ericcurtin my apologies for not referencing you in the first issue. I'll try this patch @ggerganov thank you!
Unfortunately, the patch did not work. I have a feeling it's something to do with WSL2 not giving the necessary resources to the container.
If it works correctly without the container, then the container is the cause. You can try using the official dockerfile instead and see if it works with that.
Its also worth noting that using the official ollama container works.
As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH
, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.
Makes sense! Originally I used make because docker build couldnt find libcuda using cmake
@ggerganov @slaren @JohannesGaessler
A question though that's somewhat related. llama-cli and ollama are two tools that use llama.cpp as a library. Using the exact same .gguf files with both, Ollama seems to have higher quality responses in general even when problems like above aren't encountered. We tend to call llama-cli like this in RamaLama project:
llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv
What is it about the way Ollama uses llama.cpp that seems to generate better responses?
Likely different sampling parameters. These can have a high impact on the quality of the generated text. Try to match the settings between the two tools and see if this resolves the discrepancy.
As @JohannesGaessler already pointed in the other issue, the issue may be the use of
CUDA_ARCH
, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.
Thanks a bunch @slaren! This is what fixed the issue! for reference this is what the relevant part of my new containerfile looks like
# CUDA_DOCKER_ARCH =
# Turing GPUs (e.g., RTX 20 Series, GTX 16 Series): Use 75
# Ampere GPUs (e.g., RTX 30 Series, A100): Use 80
# Hopper GPUs (e.g., H100): Use 90
# Volta GPUs (e.g., V100): Use 70
# Pascal GPUs (e.g., GTX 10 Series): Use 61
# Kepler GPUs (e.g., GTX 600 and 700 Series): Use 35
# Followed https://github.com/ggerganov/llama.cpp/blob/master/.devops/full-cuda.Dockerfile
# for reference to build llama.cpp with cuda using cmake
RUN git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
git reset --hard ${LLAMA_CPP_SHA} && \
cmake -B build -DGGML_CUDA=ON CUDA_DOCKER_ARCH=80 -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc) && \
cd build/bin && \
mv llama-cli /usr/bin/llama-cli && \
mv llama-server /usr/bin/llama-server && \
cd / && \
rm -rf llama.cpp
I also targeted docker_arch for my GPU instead of the default
@bmahabirbu if you could test this also works with podman and open a PR on ramalama that would be great!
What happened?
When using llama.cpp models (e.g., granite-code and llama3) with Nvidia GPU acceleration (nvidia/cuda:12.6.1-devel-ubi9 and RTX 3080 10GB VRAM), the models occasionally return nonsensical or garbled output after a few valid responses. This occurs even when the input prompts are simple, like basic arithmetic or listing prime numbers. Running the model using -ngl 50 in both configurations leads to the issue, suggesting it could be related to VRAM usage or GPU settings. This problem does not occur with Ollama’s GPU-accelerated version of llama3 using the exact same .gguf files.
The llama-cli command used is:
llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix --in-suffix --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv
ramalama project issue:
https://github.com/containers/ramalama/issues/247
I don't think this kinda issue is Nvidia specific, in general Ollama seems to product higher quality responses than llama-cli.
Name and Version
$ llama-cli --version version: 3821 (70392f1f) built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for x86_64-redhat-linux
What operating system are you seeing the problem on?
Linux
Relevant log output