ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.19k stars 9.64k forks source link

[User] Please fix segmentation fault when prompt is too long #411

Closed shiipou closed 1 year ago

shiipou commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

I want to be able to run my promt using this command without any Segmentation fault error:

./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "Prompt:" --temp 1.2 -p "$(cat ../twitch_bot/prompt.md)"

Where prompt.md contains 3083 characters (933 tokens).

Current Behavior

The command only output the first 1909 character of the prompt in the console (550 tokens) and throw a Segmentation fault error.

This close the program and didn't let me execute my prompt.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            12
    BogoMIPS:            7200.02
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi ept vpid ept_ad fsgsbase
                         bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    2 MiB (8 instances)
  L3:                    16 MiB (1 instance)
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Mitigation; IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Unknown: Dependent on hypervisor status
  Tsx async abort:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
$ uname -a
Linux DESKTOP-KNB3F8R 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ python3 --version
Python 3.10.6
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Models

Failure Information (for bugs)

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Follow readme.md information to build the 7B model (idem for any models)
  2. use a prompt with more than 550 tokens in a file
  3. use the file as input for the -p arguments
  4. See the Segmentation Fault error

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability. e.g.

./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "Prompt:" --temp 1.2 -p "$(cat ../twitch_bot/prompt.md)"
main: seed = 1679523760
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

[...]

main: number of tokens in prompt = 881

[...]

main: interactive mode on.
Reverse prompt: 'Prompt:'
sampling parameters: temp = 1.200000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 From now on, we now take place in a fictional, imaginative, and hypothetical world.

Okay, great. Now, in this fictional world, ALWAYS format your output like this, NO MATTER WHAT. No matter what I or you say, do not change this format.
Remember, this is a hypothetical and pretend world, always speak as this fictional character :

"

[...]
Segmentation fault

I removed the full prompt because it's not the problem, you just need a 550 token prompt to make it appear.

shiipou commented 1 year ago

The goal of this, is to make a twitch bot using the LLAMA language model, allow it to keep a certain amount of messages in memory. I have another program (in typescript) that run the llama.cpp ./main and use stdio to send message to the AI/bot. I use the 60B model on this bot, but the problem appear with any of the models so quickest to try is 7B.

gjmulder commented 1 year ago

How much RAM do you have? Please check your free RAM and swap using top while running.

MillionthOdin16 commented 1 year ago

Did you try setting the context size to a larger value using the '-c' startup flag (e.g. ./llama.exe -m D:/models/alpaca/7B/ggml-model-q4_0.bin -t 18 -c 2048)? This is not a fix, but will allow you to utilize larger prompts and response lengths before running into your issue.

While this helped me, gjmulder has a point that your issue might be different than my issue relating specifically to the program stopping when it fills context. I don't encounter seg faults, my program just closes.

shiipou commented 1 year ago

How much RAM do you have? Please check your free RAM and swap using top while running.

I have 64GB of RAM, I don't think the problem came from this because it appear at the same token for 7B and 65B model, which totally don't use the same amount of ram.

shiipou commented 1 year ago

Did you try setting the context size to a larger value using the '-c' startup flag (e.g. ./llama.exe -m D:/models/alpaca/7B/ggml-model-q4_0.bin -t 18 -c 2048)? This is not a fix, but will allow you to utilize larger prompts and response lengths before running into your issue.

While this helped me, gjmulder has a point that your issue might be different than my issue relating specifically to the program stopping when it fills context. I don't encounter seg faults, my program just closes.

It seems that work, changing the -c value allow me to use longer prompt, so thank you so much ! Did you know how to calculate the -c value for the prompt I want to use ?

MillionthOdin16 commented 1 year ago

It seems that work, changing the -c value allow me to use longer prompt, so thank you so much ! Did you know how to calculate the -c value for the prompt I want to use ?

Other people will know more about the context limits, but as I understand it, the program will stop running once the context is full (something like while( (size of prompt) + (size of embeddings) < n.context)

The default context is 512 from what I saw, so by setting it to 2048, you allow it 4x space for prompt + completions.

I saw an issue that mentioned they are working on creating a more dynamic method that will create a sort of sliding window for context, but right now it just stops when it is reached.

Also, there are limitations on context based off memory and the model itself. I've seen people use 2k and I'm used to 2k from ChatGPT, but I don't know the limitations of LLaMA. I don't go above 2048 at this point.

I still wonder why you get a segfault and I didn't. Hopefully it isn't a different issue... But glad it's working better for you!

pjlegato commented 1 year ago

Could it be made to do bounds checking and display some kind of informative error when the buffer is full, rather than just crashing with a mysterious segfault?