rsoika commented 7 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ x] I carefully followed the README.md.
[ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I run into a problem running llama-cpp-python with Mistral 7b with GPU/CUDA.

Onyl when I use small prompts like in the following example my mistral-7b-instruct-v0.2.Q4_K_M.gguf model works

    llm = Llama(model_path=model_path, n_gpu_layers=30, n_ctx=3584, n_batch=521, verbose=True)
    output = llm("Q: Name and explain the planets in the solar system? A: ", max_tokens=2000, stop=["Q:", "\n"], echo=True)
    print(output)

Outcome:

app-1  | llama_new_context_with_model: n_ctx      = 3584
app-1  | llama_new_context_with_model: n_batch    = 521
app-1  | llama_new_context_with_model: n_ubatch   = 512
app-1  | llama_new_context_with_model: freq_base  = 1000000.0
app-1  | llama_new_context_with_model: freq_scale = 1
app-1  | llama_kv_cache_init:  CUDA_Host KV buffer size =    28.00 MiB
app-1  | llama_kv_cache_init:      CUDA0 KV buffer size =   420.00 MiB
app-1  | llama_new_context_with_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
app-1  | llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
app-1  | llama_new_context_with_model:      CUDA0 compute buffer size =   272.00 MiB
app-1  | llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
app-1  | llama_new_context_with_model: graph nodes  = 1030
app-1  | llama_new_context_with_model: graph splits = 26
app-1  | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
app-1  | --- Init Model...finished in 0.9686391353607178 sec
app-1  | --- compute prompt....
app-1  | start processing prompt:
app-1  | 
app-1  |  <s>[INST] Q: Name and explain the planets in the solar system? A:  [/INST]   
app-1  | ...
app-1  | 
app-1  | Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
app-1  | Guessed chat format: mistral-instruct
app-1  | 
app-1  | llama_print_timings:        load time =     253.05 ms
app-1  | llama_print_timings:      sample time =     316.83 ms /   817 runs   (    0.39 ms per token,  2578.63 tokens per second)
app-1  | llama_print_timings: prompt eval time =     252.99 ms /    25 tokens (   10.12 ms per token,    98.82 tokens per second)
app-1  | llama_print_timings:        eval time =   36624.03 ms /   816 runs   (   44.88 ms per token,    22.28 tokens per second)
app-1  | llama_print_timings:       total time =   39066.96 ms /   841 tokens

Current Behavior

But if I try more complex prompts the model crashes with:

Llama.generate: prefix-match hit
app-1 exited with code 139

Than the only solution seems to reduce the param n_gpu_layers from a value of 30 to only 10. Also other parameters like n_ctx and n_batch can cause a crash.

This all only happens when I use the GPU. Without GPU the programm runs slow but without any chrashes.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

My hardware is a CPU Intel Core i7-7700 + GeForce GTX 1080. My programm runs in a Docker container based on nvidia/cuda:12.1.1-devel-ubuntu22.04

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU(s) scaling MHz:  21%
    CPU max MHz:         4200.0000
    CPU min MHz:         800.0000
    BogoMIPS:            7200.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
                          tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
                         cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 s
                         se4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb
                          invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 a
                         vx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara
                         t pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    8 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled

Operating System, e.g. for Linux:

$ uname -a

Linux imixs-ai 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version
Python 3.10.12

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

How Can I provide more useful information about the crash?

abetlen commented 7 months ago

Hey @rsoika might be related to #1319 in which case just update to the latest version of llama-cpp-python.

riedgar-ms commented 7 months ago

As I just noted on #1319, I'm still seeing errors which I think are related to that bug even in v0.2.59.

rsoika commented 7 months ago

Thanks a lot for your feedback! I will look into this. It looks like logits_all=True fixes the problem...

abetlen commented 7 months ago

@rsoika thanks, I'll keep this open, just trying to repro now.

Questoin about the log you linked to

>                   sampling_order = torch.multinomial(probs_torch, len(probs_torch)).cpu().numpy()
E                   RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

guidance\models\_model.py:258: RuntimeError

So is the segfault issue resolved but now it's outputing invalid values in the logprobs?

rsoika commented 7 months ago

@abetlen I did not linked a log file.

At the moment I just added the logits_all option when I create my model instance:

        model = Llama(
            model_path=model_path,
            n_gpu_layers=30, 
            n_ctx=3584, 
            n_batch=521, 
            verbose=True,
            logits_all=True,
            echo=False
        )

And this seems to solve all problems. I run my app in a Docker image with the following build script:

# See: https://github.com/abetlen/llama-cpp-python/blob/main/docker/cuda_simple/Dockerfile
ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}

ENV HOST 0.0.0.0

RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1

# Install depencencies
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

# Install llama-cpp-python (build with cuda)
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

RUN pip install fastapi-xml

COPY ./app /app
WORKDIR /app

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

If this may help you?

abetlen commented 7 months ago

Thanks that also helps! And I tagged you by mistake, sorry about that!

I meant the log @riedgar-ms posted in the other issue.

rsoika commented 7 months ago

ok, finally I also cleaned up my Dockerfile and I do in deed only build the llama-cpp-python code for my GPU. No other additional libs are needed - all is included in nvidia/cuda image

So I think this is how a minimalist Dockerfile should look like:

ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}

# Install Python3
RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y build-essential python3 python3-pip gcc 

# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1

# Install llama-cpp-python (build with cuda)
RUN python3 -m pip install --upgrade pip pytest cmake fastapi uvicorn
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

# Install fastAPI and copy app
RUN pip install fastapi-xml
COPY ./app /app
WORKDIR /app

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

riedgar-ms commented 7 months ago

I've added logits_all=True to the constructor:

llama_cpp.Llama(model_path=model, logits_all=True, **kwargs)

However, on Windows and MacOS, I'm getting AccessViolation/Segfault. Both of those are on Python 3.12. Ubuntu is not sefaulting, but torch is subsequently throwing an error:

>                   sampling_order = torch.multinomial(probs_torch, len(probs_torch)).cpu().numpy()
E                   RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I'm currently running the Ubuntu test on Python 3.12, so see if that does the same thing.

riedgar-ms commented 7 months ago

Update: Ubuntu on Python 3.12 gives the same "probability contains inf, nan or <0" error as Ubuntu on Python 3.10

Interpause commented 7 months ago

Facing similar issues with Command R+ & Miqu on a GPU offload setup. On Python 3.11 oobabooga, getting the above probability contains inf, nan or <0 after initial prompt eval too. but somehow, it works if i retry. to be exact, nan error, then it works, then nan error again, then it works, in an alternating pattern if i keep sending new messages (regenerating current message doesnt seem to run into any issues). once in a while, it segfaults instead.

EDIT: the logits all workaround works but increases the vram usage for context significantly.

EDIT 2: https://github.com/oobabooga/text-generation-webui/commit/3e3a7c42501e871fb40077106a55e59d4a3651d3 interesting commit. normally i would investigate further or provide more detailed logs but ivent time

abetlen / llama-cpp-python

Mistral 7b crashes permanently with GPU #1326

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)