ggerganov / ggml

Tensor library for machine learning
MIT License
10.83k stars 998 forks source link

starcoder -- not enough space in the context's memory pool #158

Closed bluecoconut closed 1 year ago

bluecoconut commented 1 year ago

I'm getting errors with starcoder models when I try to include any non-trivial amount of tokens. I'm getting this with both my raw model (direct .bin) and quantized model regardless of version (pre Q4/Q5 changes and post Q4/Q5 changes).

Relevant error:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411790368)

Example:

./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test " --top_k 0 --top_p 0.95 --temp 0.2 

will cause the error

main: seed = 1684223471
starcoder_model_load: loading model from '/workspaces/research/models/starcoder/starcoder-ggml.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 1
starcoder_model_load: qntvr   = 0
starcoder_model_load: ggml ctx size = 51276.47 MB
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 35916.23 MB
main: prompt: 'def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test '
main: number of tokens in prompt = 51, first 8 tokens: 589 28176 97 26 28176 97 28176 28176 

def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411952576)
Segmentation fault (core dumped)

(Here's another output from the quantized model)

vscode ➜ /workspaces/research/others/ggml (master) $ ./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo( fibo fib fibo test wate
rfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test " --top_k 0 --top_p 0.95 --temp 0.2 
main: seed = 1684223600
starcoder_model_load: loading model from '/workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 1003
starcoder_model_load: qntvr   = 1
starcoder_model_load: ggml ctx size = 28956.47 MB
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 13596.23 MB
main: prompt: 'def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test '
main: number of tokens in prompt = 51, first 8 tokens: 589 28176 97 26 28176 97 28176 28176 

def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411790368)
Segmentation fault (core dumped)

Best I can find in the past was https://github.com/ggerganov/llama.cpp/issues/29

But, maybe that was fixed in llama models, but the problem has returned for starcoder?

Based on: https://github.com/ggerganov/ggml/pull/146

Specifically hoping that @NouamaneTazi might have some clarity on why this might be happening?

NouamaneTazi commented 1 year ago

Interesting find! Thank you for raising this. Two questions:

bluecoconut commented 1 year ago

Just tried santacoder and it does seem to have the same problem, but at a very different scale. (Error is the same) (had to put in >700, maybe around 1000 tokens or so... so this might just be normal? context length issues?)

example code I used to test santacoder (note, this isn't directly on ggml executable, but through ctransformers, but, same errors show up as shown in the original post, where i directly just use the compiled ./starcoder, so i think it's safe to say that it'd behave the same on the underlying ggml)

Python 3.10.11 (main, Apr 12 2023, 14:46:22) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import lambdaprompt as lp
>>> import os
>>> os.environ['LAMBDAPROMPT_BACKEND'] = 'SantaCoderGGML'
>>> comp = lp.Completion("# Some code to print fibonacci numbers\n"*100, max_new_tokens=100)
>>> comp()
Fetching 0 files: 0it [00:00, ?it/s]
Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25575.02it/s]
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268617232, available 268435456)
Segmentation fault (core dumped)

(I did one other test with "# Some code to print fibonacci numbers\n"*60 and this one successfully ran on santacoder)

>>> len(lp.backends.backends['completion'].model.tokenize("# Some code to print fibonacci numbers\n"*60))
720
>>> len(lp.backends.backends['completion'].model.tokenize("# Some code to print fibonacci numbers\n"*100))
1200

I'll try out the starcoder.cpp and raw ggml with santacoder later / when I'm back at my machine.

bluecoconut commented 1 year ago

https://github.com/bigcode-project/starcoder.cpp/issues/3

Seems someone else has run into this on the starcoder.cpp

ggerganov commented 1 year ago

I tried looking into this but the python script from the example fails to download the model on Mac OS:

 $ ▶ python3 examples/starcoder/convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder
Loading model:  bigcode/gpt_bigcode-santacoder
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/ggml/examples/starcoder/convert-hf-to-ggml.py", line 56, in <module>
    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 766, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 473, in __getitem__
    raise KeyError(key)
KeyError: 'gpt_bigcode'

Any ideas how to fix this?

NouamaneTazi commented 1 year ago

@ggerganov I think you're on an old version of transformers Try updating it: pip install -U transformers

NouamaneTazi commented 1 year ago

@ggerganov I've been trying to increase context's memory pool by modifying this part of the code

        ctx_size += 10 * 1024 * 1024; // TODO: tune this

        printf("%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size/(1024.0*1024.0));

but it doesnt seem to affect ctx->mem_size because the error message is always the same: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268637760, available 268435456) ( ctx->mem_size = 268435456 where it should be more)

Any idea how to increase ctx->mem_size? Relevant PR

ggerganov commented 1 year ago

The problem is in the "eval" context:

https://github.com/ggerganov/ggml/blob/c2fab8a3503b6e6fbf480be993f24c21951d3af0/examples/starcoder/main.cpp#L415-L431

Currently, it starts with a 256 MB buffer and is increased based on N. But this does not take into account n_past and in general is a very memory wasteful approach since the entire compute graph results are stored in this buffer.

Here I tried to improve this using scratch buffers: https://github.com/ggerganov/ggml/pull/176

Please give it a try and let me know if your tests still crash using this version

vmajor commented 1 year ago

I am observing a similar issue with the python wrapper llama-cpp-llama: https://github.com/abetlen/llama-cpp-python/issues/356

eshaanagarwal commented 1 year ago

Hi I was trying GPT4all 1.3 groovy model and i faced the same issue. i am not able to understand why this is happening, Can anybody provide me with some solution for it.

vmajor commented 1 year ago

@eshaanagarwal the only "solution" that I found was a reboot. Since rebooting is not an option I had to switch to different models. For me all 30B/33B LLM models eventually develop this error when the input context is reaching the upper limit. This does not affect the 65B models. I do not know about any other relationships as this is my use case.

eshaanagarwal commented 1 year ago

@eshaanagarwal the only "solution" that I found was a reboot. Since rebooting is not an option I had to switch to different models. For me all 30B/33B LLM models eventually develop this error when the input context is reaching the upper limit. This does not affect the 65B models. I do not know about any other relationships as this is my use case.

@ggerganov can the memory leak or the issue be fixed ? Or any possible direction as to how to fix it ? Because I really need for this model to work

ggerganov commented 1 year ago

@eshaanagarwal If you are using the latest version of the starcoder example the issue should not occur. It was fixed in https://github.com/ggerganov/ggml/pull/176

If the issue occur, please provide more details about the model that you are using, your system information and the parameters with which you trigger the error