abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.22k stars 981 forks source link

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 546644800, available 536870912) Segmentation fault #356

Open vmajor opened 1 year ago

vmajor commented 1 year ago

Expected Behavior

This happens (so far) only with these models: Wizard-Vicuna-30B-Uncensored.ggmlv3.q8_0.bin WizardLM-30B-Uncensored.ggmlv3.q8_0.bin based-30b.ggmlv3.q8_0.bin

Larger 65B models work fine. It could be something related to how these models are made, I will also reach out to @ehartford

llama-cpp-python 0.1.59 installed with OpenBLAS

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

I was running my usual code on the CPU and restarting it to tweak the results when this error came up. I made no code changes, only to context length, I reduced it as it was exceeding the 2048 token limit.

processed_output = self.llm(
            context + "\n### Instruction: \n" + instruction + "\n### Input: \n" + input_text + output,
            max_tokens=400,
            stop=None,
            temperature=0.7,
            repeat_penalty=1.1,
            top_k=80,
            top_p=0.5,
            echo=True,
        )

Current Behavior

llama.cpp: loading model from /home/****/models/Wizard-Vicuna-30B-Uncensored-GGML/Wizard-Vicuna-30B-Uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 35267.28 MB (+ 6248.00 MB per state)
.
llama_init_from_file: kv self size  = 6240.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Processing all summaries...
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 546644800, available 536870912)
Segmentation fault

Environment and Context

wsl2 python 3.10.9

$ lscpu AMD Ryzen 9 3900XT 12-Core Processor

$ uname -a 5.15.68.1-microsoft-standard-WSL2+ #2 SMP

$ python3 3.10.9
$ make GNU Make 4.3
$ g++(Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

To me it is 100% reproducible after several inference runs with Wizard-Vicuna-30B-Uncensored.ggmlv3.q8_0.bin

ehartford commented 1 year ago

Standing by!

vmajor commented 1 year ago

It seems to be a llama.cpp issue. I found it mentioned regarding starcoder models too. I think you can carry on :)

vmajor commented 1 year ago

Update. The issue resolves on reboot so there is some memory leak in the code.

This error would likely resolve if I restart wsl2, but it is messy for me to do that due to needing to remount my ext4 partitions, and I do not think that particular data point is as significant.

eshaanagarwal commented 1 year ago

Update. The issue resolves on reboot so there is some memory leak in the code.

This error would likely resolve if I restart wsl2, but it is messy for me to do that due to needing to remount my ext4 partitions, and I do not think that particular data point is as significant.

Hey but for a commercial application we can't afford to have it like this right ? This is happening with me on groovy 1.3 . Even that only on some OS like RHEL. It's really frustrating and difficult to deal with this

gjmulder commented 1 year ago

Hey but for a commercial application we can't afford to have it like this right ? This is happening with me on groovy 1.3 . Even that only on some OS like RHEL. It's really frustrating and difficult to deal with this

  1. Facebook did not release their llama models for commercial application.
  2. Did you pay for a license for any of the models or llama inference code you are using?
eshaanagarwal commented 1 year ago

Hi I am using gpt-4-all gpt-j 1.3 groovy which has Apache license.

vmajor commented 1 year ago

Another update, my guanaco-65B-GGML-q6_K.bin model just failed with the same error. So it is not just 30B models that are affected.

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1143972864, available 1073741824)
Segmentation fault
ibidani commented 1 year ago

In hope to help isolate the bug I tried to reproduce the issue since version 0.1.55. The first release that I experience the issue is 0.1.76(0.1.75 wasn't tested, isn't available on pypi) and didn't see it on 0.1.74 Could it be related to this change? https://github.com/abetlen/llama-cpp-python/compare/v0.1.74...v0.1.76#diff-9184e090a770a03ec97535fbef520d03252b635dafbed7fa99e59a5cca569fbcR200

Environment

python -V
Python 3.10.12
uname -a
Linux Idan-PC 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Model: nous-hermes-13b.ggmlv3.q4_0.bin