Open rsoika opened 7 months ago
Hey @rsoika might be related to #1319 in which case just update to the latest version of llama-cpp-python
.
As I just noted on #1319, I'm still seeing errors which I think are related to that bug even in v0.2.59.
Thanks a lot for your feedback! I will look into this.
It looks like logits_all=True
fixes the problem...
@rsoika thanks, I'll keep this open, just trying to repro now.
Questoin about the log you linked to
> sampling_order = torch.multinomial(probs_torch, len(probs_torch)).cpu().numpy()
E RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
guidance\models\_model.py:258: RuntimeError
So is the segfault issue resolved but now it's outputing invalid values in the logprobs?
@abetlen I did not linked a log file.
At the moment I just added the logits_all
option when I create my model instance:
model = Llama(
model_path=model_path,
n_gpu_layers=30,
n_ctx=3584,
n_batch=521,
verbose=True,
logits_all=True,
echo=False
)
And this seems to solve all problems. I run my app in a Docker image with the following build script:
# See: https://github.com/abetlen/llama-cpp-python/blob/main/docker/cuda_simple/Dockerfile
ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}
ENV HOST 0.0.0.0
RUN apt-get update && apt-get upgrade -y \
&& apt-get install -y git build-essential \
python3 python3-pip gcc wget \
ocl-icd-opencl-dev opencl-headers clinfo \
libclblast-dev libopenblas-dev \
&& mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1
# Install depencencies
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context
# Install llama-cpp-python (build with cuda)
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
RUN pip install fastapi-xml
COPY ./app /app
WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
If this may help you?
Thanks that also helps! And I tagged you by mistake, sorry about that!
I meant the log @riedgar-ms posted in the other issue.
ok, finally I also cleaned up my Dockerfile and I do in deed only build the llama-cpp-python code for my GPU. No other additional libs are needed - all is included in nvidia/cuda image
So I think this is how a minimalist Dockerfile should look like:
ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}
# Install Python3
RUN apt-get update && apt-get upgrade -y \
&& apt-get install -y build-essential python3 python3-pip gcc
# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1
# Install llama-cpp-python (build with cuda)
RUN python3 -m pip install --upgrade pip pytest cmake fastapi uvicorn
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Install fastAPI and copy app
RUN pip install fastapi-xml
COPY ./app /app
WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
I've added logits_all=True
to the constructor:
llama_cpp.Llama(model_path=model, logits_all=True, **kwargs)
However, on Windows and MacOS, I'm getting AccessViolation/Segfault. Both of those are on Python 3.12. Ubuntu is not sefaulting, but torch
is subsequently throwing an error:
> sampling_order = torch.multinomial(probs_torch, len(probs_torch)).cpu().numpy()
E RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I'm currently running the Ubuntu test on Python 3.12, so see if that does the same thing.
Update: Ubuntu on Python 3.12 gives the same "probability contains inf, nan or <0" error as Ubuntu on Python 3.10
Facing similar issues with Command R+ & Miqu on a GPU offload setup. On Python 3.11 oobabooga, getting the above probability contains inf, nan or <0 after initial prompt eval too. but somehow, it works if i retry. to be exact, nan error, then it works, then nan error again, then it works, in an alternating pattern if i keep sending new messages (regenerating current message doesnt seem to run into any issues). once in a while, it segfaults instead.
EDIT: the logits all workaround works but increases the vram usage for context significantly.
EDIT 2: https://github.com/oobabooga/text-generation-webui/commit/3e3a7c42501e871fb40077106a55e59d4a3651d3 interesting commit. normally i would investigate further or provide more detailed logs but ivent time
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I run into a problem running llama-cpp-python with Mistral 7b with GPU/CUDA.
Onyl when I use small prompts like in the following example my
mistral-7b-instruct-v0.2.Q4_K_M.gguf
model worksOutcome:
Current Behavior
But if I try more complex prompts the model crashes with:
Than the only solution seems to reduce the param
n_gpu_layers
from a value of 30 to only 10. Also other parameters liken_ctx
andn_batch
can cause a crash.This all only happens when I use the GPU. Without GPU the programm runs slow but without any chrashes.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
My hardware is a CPU Intel Core i7-7700 + GeForce GTX 1080. My programm runs in a Docker container based on nvidia/cuda:12.1.1-devel-ubuntu22.04
$ lscpu
$ uname -a
Linux imixs-ai 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
How Can I provide more useful information about the crash?