abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.29k stars 869 forks source link

Segmentation fault while generating. #218

Closed Firstbober closed 1 year ago

Firstbober commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Continue the generation and gracefully exit.

Current Behavior

Segmentation fault while generating tokens. It usually happens after generating ~121 tokens (I did 4 different prompts which crashed at token 122, 121, 118 and 124), and it doesn't seem to happen in the llama.cpp ./main example.

Environment and Context

I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence. I am also using libllama.so built from the latest llama.cpp source, so I can debug it with gdb.

Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 3600 6-Core Processor
    CPU family:          23
    Model:               113
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  94%
    CPU max MHz:         4208,2031
    CPU min MHz:         2200,0000
    BogoMIPS:            7186,94
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
                         x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
                         od nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl p
                         ni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe
                          popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy sv
                         m extapic cr8_legacy abm sse4a misalignsse 3dnowprefetc
                         h osvw ibs skinit wdt tce topoext perfctr_core perfctr_
                         nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate
                          ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bm
                         i2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsa
                         veopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_tota
                         l cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd
                          arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
                         flushbyasid decodeassists pausefilter pfthreshold avic 
                         v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_re
                         cov succor smca sev sev_es
Number of devices                                 1
  Device Name                                     gfx803
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 
  Driver Version                                  3513.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon RX 580 Series

Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux

Python 3.11.3

GNU Make 4.4.1
Built for x86_64-pc-linux-gnu

g++ (GCC) 13.1.1 20230429

Failure Information (for bugs)

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff6c35769 in ggml_element_size (tensor=0x7ffea0fff130) at ggml.c:3666
3666        return GGML_TYPE_SIZE[tensor->type];

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Add my class to the source code https://gist.github.com/Firstbober/d7f97e7f743a973c14425424e360eeda
  2. Create an instance with WizardLM-7b
  3. Use llamaChat.load_context with some lengthy prompt (mine has 1300 characters)
  4. llamaChat.generate try to generate something, I used this piece of code:
    
    tokens = ""
    i = 0
    for token in llamaChat.generate('[[YOU]]: Write me a long essay about cookies, as long as you can.\n'):
    print(token, i)
    tokens += token
    i += 1

print(tokens)


5. Watch as the python crumbles at around token 121.
gjmulder commented 1 year ago

I suspect you're hitting some internal memory buffer limit in ggml.c or maybe CLBlast. Can you watch the memory utilisation on your GPU when it is running? For CUDA I would use nvidia-smi.

Given that you only have 4GB of VRAM, are you setting n_gpu_layers? If so, try reducing to a smaller number and seeing if that changes when the problem occurs, perhaps? For reference Vicuna 13B w/40 CuBLAS layers on my NVidia GPU uses 11GB of VRAM.

If you have a single stand-alone python script that generates the error, I can try and reproduce with my NVidia GPU. If I can't repro it points to CLBlast as part of the issue.

Finally, stupid question, but did you use the exact same params and prompt length with ./main?

Firstbober commented 1 year ago

I suspect you're hitting some internal memory buffer limit in ggml.c or maybe CLBlast. Can you watch the memory utilisation on your GPU when it is running? For CUDA I would use nvidia-smi.

Using radeontop I registered nothing out of ordinary. Through the entire run time of the script, the VRAM utilization stayed at the comfortable range of ~1830M.

Given that you only have 4GB of VRAM, are you setting n_gpu_layers? If so, try reducing to a smaller number and seeing if that changes when the problem occurs, perhaps? For reference Vicuna 13B w/40 CuBLAS layers on my NVidia GPU uses 11GB of VRAM.

I specified 32 n_gpu_layers in my ./main and in my python script I just use the defaults. AFAIK the 7B models has 31 layers, which easily fit into my VRAM, as while chatting for a while using ./main example I sit at around 2100M with more than 500 tokens generated already.

If you have a single stand-alone python script that generates the error, I can try and reproduce with my NVidia GPU. If I can't repro it points to CLBlast as part of the issue.

https://gist.github.com/Firstbober/a08de9cf01ea90b6be8389be9a249857 I changed the prompt a few times, and in some cases the error doesn't appear. Maybe there is something in it that makes the library uncomfortable? The prompt I attached in the script is the one that seg faults, the DAN one from the llama.cpp repo seems to be working fine.

Finally, stupid question, but did you use the exact same params and prompt length with ./main?

Yes

gjmulder commented 1 year ago

I modified your script to take the model from sys.argv[1] and I also notice it isn't offloading any layers to the GPU or loading the GPU.

``` $ LLAMA_CUBLAS=1 pip install --force-reinstall --ignore-installed --no-cache-dir llama-cpp-python --verbose $ export N_GPU_LAYERS=1000 $ python ./test.py /data/llama/7B/ggml-model-q4_1.bin [LLaMAChat] Init [LLaMAChat] Loading model from file llama.cpp: loading model from /data/llama/7B/ggml-model-q4_1.bin llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 3 (mostly Q4_1) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 72.75 KB llama_model_load_internal: mem required = 6612.59 MB (+ 2052.00 MB per state) llama_model_load_internal: [cublas] offloading 0 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 0 MB llama_init_from_file: kv self size = 512.00 MB [LLaMAChat] Tokenizing context prompt [LLaMAChat] Making first evaluation of input tokens [LLaMAChat] Generating completion based on '[[YOU]]: Write me a long essay about cookies, as long as you can. ' [[ 0 Y 1 OU 2 ]] 3 : 4 I 5 ' 6 m 7 etc. OU 186 ]] 187 : 188 I 189 ^CTraceback (most recent call last): File "/home/mulderg/Work/./test.py", line 165, in for token in llamaChat.generate('[[YOU]]: Write me a long essay about cookies, as long as you can.\n'): File "/home/mulderg/Work/./test.py", line 102, in generate llama_cpp.llama_eval( File "/home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 336, in llama_eval return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads) KeyboardInterrupt ```

I suspect the CLBast support is very new and somewhat unstable, given that most devs (including @ggerganov) are using NVidia GPUs, sorry.

Firstbober commented 1 year ago

Well, I compiled libllama.so without the support for CLBlast and segmentation fault still persists.

Firstbober commented 1 year ago

I pinned the problem down to the n_past argument in the llama_eval, so the next few hours will be figuring out how to stop llama from repeating itself after reaching max context… Yay. I saw something about context switching in the ./main example, so probably it will be useful.

grantslatton commented 1 year ago

Just spent a few hours debugging on a related issue.

The n_parts argument got removed from a recent version of llama.cpp, so if you are compiling from source on a newer commit, you will hit this issue.

This llama.cpp commit removes the n_parts parameter: https://github.com/ggerganov/llama.cpp/commit/dc271c52ed65e7c8dfcbaaf84dabb1f788e4f3d0

So this code in llama-cpp-python is now invalid when paired with llama.cpp mainline: https://github.com/abetlen/llama-cpp-python/blob/01a010be521c076f851789ad56bec82284fdf96e/llama_cpp/llama_cpp.py#L116

Deleting this line fixes the issue.

For me, it manifested as GGML_ASSERT: ggml.c:5702: ggml_is_contiguous(a) but I think it could manifest in a lot of ways since it is basically just memory corruption due to the literal byte cast interpretation of this params struct.

AlphaAtlas commented 1 year ago

As a side note, I don't think token generation is actually accelerated in CLBlast yet? Its behind a PR in the llama.cpp repo, and my observation is that the GPU has no load no matter what I set n_gpu_layers to... but maybe something was wrong with my quick CLBlast test.

Point being that maybe this has nothing to do with the GPU.

TheTerrasque commented 1 year ago

https://github.com/ggerganov/llama.cpp/pull/1459 adds fairly good OpenCL support, was merged 5 days ago. Also in the readme it now says "The CLBlast build supports --gpu-layers|-ngl like the CUDA version does."

I've tested the win clblast builds, and they works pretty well on my 3080, with 250ms per token with some offloading, and 450ms without offloading. With that said, I can't get it to work with llama-cpp-python. It seems to ignore gpu layers with clblast.

gjmulder commented 1 year ago

I confirmed that the latest llama-cpp-python should have picked up the CLBlast support:

/vendor/llama.cpp$ git log | head -3
commit 66874d4fbcc7866377246efbcee938e8cc9c7d76
Author: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Date:   Thu May 25 20:18:01 2023 -0600
gjmulder commented 1 year ago

Closing. Please update to the latest llama-cpp-python which should include better CLBlast support.