ggml_new_tensor_impl: not enough space in the scratch memory

LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

4.66k stars 334 forks source link

ggml_new_tensor_impl: not enough space in the scratch memory #50

Closed Belarrius1 closed 1 year ago

Belarrius1 commented 1 year ago

Every time I advance a little in my discussions, I crash with the following error:

Processing Prompt [BLAS] (1024 / 1301 tokens)ggml_new_tensor_impl: not enough space in the scratch memory

My RAM is only 40% loaded, my Max Token is 2048 etc... I don' understand

LostRuins commented 1 year ago

What model are you using? I previously increased the batch size for BLAS from 512 to 1024, but I will have to revert it if it's causing issues.

Belarrius1 commented 1 year ago

I use LLaMA 30B in 4 bits

LostRuins commented 1 year ago

Can you please try newest version v1.6, which has a reduced batch size, and let me know if it's still crashing?

Belarrius1 commented 1 year ago

Okay, so, I crash again

Processing Prompt (1 / 1 tokens) ./bot.sh: line 1: 597004 Segmentation fault sudo python3 koboldcpp.py models/30B/ggml-model-q4_0.bin --port 666 --host 0.0.0.0 --stream --psutil_set_threads belaserver@belaserver:/home/llama/koboldcpp$

Here the content of my "bot.sh"

sudo python3 koboldcpp.py models/30B/ggml-model-q4_0.bin --port 666 --host 0.0.0.0 --stream --psutil_set_threads

RAM Used -> 38% before crash (38% of 64GB)

I have 128 GB total of SWAP (64 GB per NvME) and 32GB of ZRAM On Ubuntu Server 22.04 LTS I use a OpenBLAS "libopenblas0 version (0.3.20+ds-1)"

LostRuins commented 1 year ago

Sounds like there is still some memory issues. So you can generate some tokens until eventually it crashes after a certain length?

What if you run it with --nommap?

Belarrius1 commented 1 year ago

Yes that's it. And sometimes after the crash, if I relunch, I have a huge "BLAS" and the AI can respond just one message message, then after it crashes again, as if it had reached a maximum.

I will try with nommap

Belarrius1 commented 1 year ago

After 26 answers, I crash again.

Processing Prompt (1 / 1 tokens) ./bot.sh: line 1: 1106934 Segmentation fault sudo python3 koboldcpp.py models/30B/ggml-model-q4_0.bin --port 666 --host 0.0.0.0 --stream --psutil_set_threads --nommap

It makes the BLAS and "processing prompt" for 18 tokens then crash again.

LostRuins commented 1 year ago

It does somehow seem like you are running out of memory somehow, perhaps the scratch buffers are too small for 30B? Do you get the same issue on the main llama.cpp repo?