Open bmahabirbu opened 8 hours ago
ggml_debug: Kcur-2 = (f32) ROPE(Kcur-2 (reshaped){128, 8, 11, 1}, CUDA0#inp_pos#0{11, 1, 1, 1}}) = {128, 8, 11, 1}
[
[
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
https://github.com/ggerganov/llama.cpp/issues/7048 I checked out llama-eval-callback with a prompt that causes the error and its the same as this issue above. Although the issue wasn't resolved the author had a hardware error. Im not sure how to test for vram hardware error although I'm sure that i don't have one. Could this be with an issue with WSL2 and vram allocation?
Could this be with an issue with WSL2 and vram allocation?
I don't think that's very likely, I use CUDA almost exclusively with WSL2.
https://github.com/ggerganov/llama.cpp/issues/6957#issuecomment-2251308757
Came across this. A little more insight would be great as I'm not sure what this means? Is it a problem with how I'm using the llama3 model?
I don't know, you can try adding this arch to the list of exceptions to see if that fixes it.
If this problem only occurs for specific prompts (and not for all prompts of the same length) it could be due to numerical issues. Does the ggml_debug
print above mean that the output of the CUDA implementation of RoPE is nan even though the inputs are not nan?
Not sure. The ggml_dbug starts as normal then nans start to flood after
ggml_debug: k-7 = (f16) VIEW(cache_k_l7{524288, 1, 1, 1}, }) = {64, 32, 4, 1}
[
[
[ -0.0157, -0.0056, -0.0117, ..., 0.3660, 1.1211, -1.8447],
[ 0.2224, -0.1504, 0.7095, ..., -0.4265, -0.0087, 0.7388],
[ 0.3860, 0.6807, -1.2812, ..., -2.0469, -1.5107, 0.7300],
from what I've found its prompts of the same length.
An Important update from me. In WSL2 I'm using Podman with a container I made
FROM nvidia/cuda:12.6.1-devel-ubi9
# renovate: datasource=github-releases depName=huggingface/huggingface_hub extractVersion=^v(?<version>.*)
ARG HUGGINGFACE_HUB_VERSION=0.25.0
# renovate: datasource=github-releases depName=containers/omlmd extractVersion=^v(?<version>.*)
ARG OMLMD_VERSION=0.1.4
# renovate: datasource=git-refs depName=ggerganov/llama.cpp packageName=https://github.com/ggerganov/llama.cpp gitRef=master versioning=loose type=digest
ARG LLAMA_CPP_SHA=dca1d4b58a7f1acf1bd253be84e50d6367f492fd
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528
# vulkan-headers vulkan-loader-devel vulkan-tools glslc glslang python3-pip mesa-libOpenCL-$MESA_VER.aarch64
RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
crb enable && \
dnf install -y epel-release && \
dnf --enablerepo=ubi-9-appstream-rpms install -y git procps-ng vim \
dnf-plugins-core python3-dnf-plugin-versionlock cmake gcc-c++ \
python3-pip && \
dnf clean all && \
rm -rf /var/cache/*dnf*
RUN /usr/bin/python3 --version
RUN pip install "huggingface_hub[cli]==${HUGGINGFACE_HUB_VERSION}"
RUN pip install "omlmd==${OMLMD_VERSION}"
# Build wouldnt complete couldnt find libcuda so made a systemlink
# But this didnt work
# RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
# RUN LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs/
# Disable cashing for debugging build
# ENV GGML_CCACHE=0
# Build wouldnt complete with cmake even with nvidia container toolkit installed
RUN git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
git reset --hard ${LLAMA_CPP_SHA} && \
make -j $(nproc) GGML_CUDA=1 CUDA_ARCH=ALL && \
mv llama-cli /usr/bin/llama-cli && \
mv llama-server /usr/bin/llama-server && \
mv llama-eval-callback /usr/bin/llama-eval-callback && \
cd / && \
rm -rf llama.cpp
That builds llama.cpp with cuda from a maintained nvidia container
However when I built llama.cpp with cuda on wsl2 without using a container it ran perfectly! something is wrong when trying to do this from within a container. I would greatly appreciate anyone who got llama.cpp with cuda running in a container for additional setup advice. I have a feeling the container probably does not have access to the full vram or something along those lines
The Makefile has no argument CUDA_ARCH
so that argument is being ignored. Does it work with CUDA_DOCKER_ARCH=compute_86
?
What happened?
When I run llama-cli with cuda support depending on how long the prompt is I get back garbage what could be causing this issue? Im running this in a container on WSL2.
Name and Version
version: 3899 (dca1d4b5) built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for x86_64-redhat-linux
What operating system are you seeing the problem on?
WSL2 linux
Relevant log output