LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.81k stars 343 forks source link

WizardCoder doesnt work #258

Closed PedzacyKapec closed 11 months ago

PedzacyKapec commented 1 year ago

When I enter for instance: "Write a code in python to read btc price" i get an error in the main console:

Processing Prompt [BLAS] (49 / 49 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 1221854448, available 268435456) exception: access violation writing 0x0000000000000000

My specs: 16Gb ram, CPU: I7 10700, Win10

Model: WizardCoder-15B-1.0.ggmlv3.q4_0.bin

richardr1126 commented 1 year ago

This is an out of RAM memory error for the context size. I had the same error running on my Mac with 16gb, the error didn't show when I just typed "Hello". WizardCoder uses around 11GB and your os (Windows) is probably taking up the rest of the memory, macOS probably uses less memory which is why I could send "Hello".

It's annoying this model is right on the cutoff of running on a 16 GB machine because that is what most people have. We need 2 or 3-bit k quantization for WizardCoder!

TheMachineThinker commented 1 year ago

I'm having the same issue on Linux with koboldcpp v1.32 and LLAMA_OPENBLAS=1. Processing Prompt [BLAS] (165 / 165 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 1253214816, available 268435456) To fix this issue, I had to use the --noblas flag, which results in a similar prompt processing speed as koboldcpp v1.31.2, but generation speed is worse by about 127 ms/T (450ms vs 323ms). I was also able to prevent this error by modifying line 400 in gpt2_v3.cpp to increase the buffer size: static size_t buf_size = 1280u*1024*1024; and recompiling, resulting in a slight improvement in processing speed (~15 ms/T) but generation speed is still the same as using the --noblas flag, which is again about 127 ms/T slower than v1.31.2.

Edit: Ran some tests again after closing some applications. The generation speed seems similar, but the ms/T calculations are off in v1.31.2 and it seems that --noblas gives me faster prompt processing in both vesions. The numbers below are for WizardCoder-15B-1.0.ggmlv3.q5_1, Intel 9th gen with RAM bandwidth a little over 22GB/s

V1.31.2

Processing Prompt [BLAS] (142 / 142 tokens)
Generating (266 / 512 tokens)
Time Taken - Processing:34.6s (244ms/T), Generation:119.0s (232ms/T), Total:153.6s (3.3T/s)

V1.31.2 --noblas

Processing Prompt (142 / 142 tokens)
Generating (272 / 512 tokens)
Time Taken - Processing:31.1s (219ms/T), Generation:121.1s (236ms/T), Total:152.1s (3.4T/s)

V1.32 --noblas

Processing Prompt (142 / 142 tokens)
Generating (272 / 512 tokens)
Time Taken - Processing:31.1s (219ms/T), Generation:120.1s (441ms/T), Total:151.1s (1.8T/s)
h3ndrik commented 1 year ago

Also Segmentation fault and not enough memory with mpt 40b instruct. (linux, openblas)

Wingie commented 1 year ago

Processing Prompt [BLAS] (79 / 79 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 1464812096, available 268435456) in m2 mac 64GB

PedzacyKapec commented 1 year ago

Guys. It has absolutely nothing to do with memory. You can run the model in oobabooba app no problem.

LostRuins commented 1 year ago

I'm quite sure ooba does not have any support for non-llama ggml, unless something has changed very recently.

Anyway, I am looking into this now.

LostRuins commented 1 year ago

Hi everyone, please try version 1.32.1 which should have this issue fixed.

Note that OpenBLAS takes more RAM especially for large models. If you run out of RAM, try adjusting --blasbatchsize to lower value like 128 or 64. Let me know if there are other issues.

LostRuins commented 1 year ago

@TheMachineThinker yes, the old generation ms/T calculations were wrong, as it also counted tokens that were not generated.

You can see from total time that it is about the same.

h3ndrik commented 1 year ago

MPT40b is working now. It dialed down the batch size from 512 to 256 on its own.

TheMachineThinker commented 1 year ago

@LostRuins It works as expected. Thank you for the quick turn around.