GPTNeoX model with 16k context results in context-size related issues: `ggml_new_tensor_impl: not enough space in the context's memory pool`, and instant core dump with fp16

Hey guys

Today I was doing quants of a new GPTNeoX model called Literature-7B-16384

I tried making GGMLs through the usual process:

python examples/gpt-neox/convert-h5-to-ggml.py /workspace/models/hakurei_Literature-7B-16384 0
build/bin/gpt-neox-quantize /workspace/process/literature-7b/ggml/ggml-model-f32.bin /workspace/process/literature-7b/ggml/literature-7b-16384.gptneox.ggmlv3.q4_0.bin q4_0

Both steps completed fine. But the models can't be used.

Trying to use the fp32:

[pytorch2] ubuntu@h100:/workspace/git/ggml git:(master) $ build/bin/gpt-neox -m /workspace/process/literature-7b/ggml/ggml-model-f32.bin   -p "test"
main: seed = 1685827980
gpt_neox_model_load: loading model from '/workspace/process/literature-7b/ggml/ggml-model-f32.bin' - please wait ...
gpt_neox_model_load: n_vocab = 50432
gpt_neox_model_load: n_ctx   = 16384
gpt_neox_model_load: n_embd  = 4096
gpt_neox_model_load: n_head  = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot   = 128
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype   = 0
gpt_neox_model_load: qntvr   = 0
gpt_neox_model_load: ggml ctx size = 11822.28 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 12661385472, available 12396563456)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 12661451264, available 12396563456)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 12594392320, available 12396563456)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 12460224000, available 12396563456)
.... lots of similar lines removed ...
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16691523072, available 12396563456)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16691523072, available 12396563456)
[1]    1178727 segmentation fault (core dumped)  build/bin/gpt-neox -m /workspace/process/literature-7b/ggml/ggml-model-f32.bi
[pytorch2] ubuntu@h100:/workspace/git/ggml git:(master) $

Trying an fp16 conversion instead is even more spectacular:

[pytorch2] ubuntu@h100:/workspace/git/ggml git:(master) $ build/bin/gpt-neox -m /workspace/process/literature-7b/ggml/ggml-model-f16.bin -n 100  -p "test"
main: seed = 1685827752
gpt_neox_model_load: loading model from '/workspace/process/literature-7b/ggml/ggml-model-f16.bin' - please wait ...
gpt_neox_model_load: n_vocab = 50432
gpt_neox_model_load: n_ctx   = 16384
gpt_neox_model_load: n_embd  = 4096
gpt_neox_model_load: n_head  = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot   = 128
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype   = 1
gpt_neox_model_load: qntvr   = 0
gpt_neox_model_load: ggml ctx size = 17592186043162.29 MB
GGML_ASSERT: /workspace/git/ggml/src/ggml.c:3982: ctx->mem_buffer != NULL
[1]    1178038 abort (core dumped)  build/bin/gpt-neox -m /workspace/process/literature-7b/ggml/ggml-model-f16.bi

And then trying a quantised version made from either fp32 or fp16 gives the same errors as with the fp32:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 116195584, available 265216)

I tried various -n values with both files but that made no difference.

I assume it's because some support needs to be made for the unusually large context size? I have previously tested GPTNeoX models with 4k and 8k context and they seemed to work.

I don't know if this is a bug or a feature request, but I thought I'd let you guys know. Let me know if you'd like me to upload the fp16, fp32 or q4_0 GGMLs anywhere for inspection.

Thanks in advance!

gpt_neox_model_load: ggml ctx size = 17592186043162.29 MB

It seems to be a calculation error with signed and unsigned integers.

Change int to size_t in these lines:

        const int n_embd  = hparams.n_embd;
        const int n_layer = hparams.n_layer;
        const int n_ctx   = hparams.n_ctx;
        const int n_vocab = hparams.n_vocab;

        const size_t n_embd  = hparams.n_embd;
        const size_t n_layer = hparams.n_layer;
        const size_t n_ctx   = hparams.n_ctx;
        const size_t n_vocab = hparams.n_vocab;

Working on q8_0 quantization:

./main -m litterature-7b-q8_0.bin 
main: seed = 1685837187
gpt_neox_model_load: loading model from 'litterature-7b-q8_0.bin' - please wait ...
gpt_neox_model_load: n_vocab = 50432
gpt_neox_model_load: n_ctx   = 16384
gpt_neox_model_load: n_embd  = 4096
gpt_neox_model_load: n_head  = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot   = 128
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype   = 2007
gpt_neox_model_load: qntvr   = 2
gpt_neox_model_load: ggml ctx size = 25384.91 MB
gpt_neox_model_load: memory_size =  8192.00 MB, n_mem = 524288
gpt_neox_model_load: ................................................ done
gpt_neox_model_load: model size =  6953.16 MB / num tensors = 388
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 1
main: token[0] =   3726, They

They-the-heavens! I've been sitting here, and he's never come back!^C

ggerganov / ggml

GPTNeoX model with 16k context results in context-size related issues: `ggml_new_tensor_impl: not enough space in the context's memory pool`, and instant core dump with fp16 #225