CLBlast fails on context lengths above 2048 after merging #4256

LostRuins commented 11 months ago

Inference with CLBlast fails with a segfault after the commit that merged https://github.com/ggerganov/llama.cpp/pull/4256 on context sizes above 2k when all GPU layers are offloaded.

Command line: C:\test\llama-b1601-bin-win-clblast-x64>main.exe -m E:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf -c 4096 -b 512 -n 32 -ngl 33 -f C:\test\test.txt

main: build = 1601 (5a7d312)
main: built with MSVC 19.37.32826.1 for x64
main: seed  = 1701534899
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 2060'
ggml_opencl: device FP16 support: false

Result: Prompt processing starts, and then segfaults halfway around the 2k token mark, before generation begins. Only if the prompt is short enough (less than 2k tokens) it appears to work.

ggerganov commented 11 months ago

Does it work with this patch:

diff --git a/llama.cpp b/llama.cpp
index fd905ade..69c45c3f 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -3813,7 +3813,7 @@ static struct ggml_tensor * llm_build_kqv(
     struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
     cb(kq, "kq", il);

-    if (max_alibi_bias > 0.0f) {
+    if (true) {
         // temporary branch until we figure out how to handle ggml_alibi through ggml_add
         kq = ggml_scale(ctx, kq, kq_scale);
         cb(kq, "kq_scaled", il);

LostRuins commented 11 months ago

Nope, unfortunately this did not fix the issue, it still segfaults around the same point.

ggerganov commented 11 months ago

Hm, I don't see what could have affected the OpenCL backend in that change. Any extra information that you can provide (e.g. stack trace)? Does it depend on the value of -ngl?

LostRuins commented 11 months ago

There's no stack trace. In fact, there's no printout whatsoever, the program simply halts. I tried it again with 0 layers offloaded and it seems to happen too, it still crashes at the same place. CUDA is fine, however.

Here's a video of b1579 vs b1601, showing the differences. The video has been sped up by 2x, but you can rewind/pause it at any point to review.

https://github.com/ggerganov/llama.cpp/assets/39025047/08233bcb-f538-4df9-a66a-2dcbc42ee0b5

The test text file I used for input is the first 5 sections of the GPL license, which you can find here: test.txt

I am able to repro this consistently as it crashes at the same place. Reducing the prompt to a shorter one allows it to work. I am on Windows 10, running with RTX2060.

AlpinDale commented 11 months ago

Can confirm this happens for me too. Same command and prompt as @LostRuins. Hardware is RTX 2070S and Intel i7-8700, and I'm using Linux 6.5.9. Happens with -ngl 0 and -ngl 99. The error I get is:

free(): invalid next size (normal)
zsh: IOT instruction (core dumped)

Different error followed by a segfault with -ngl 32 (7B GGUF model):

ggml_opencl: clSetKernelArg(*to_fp32_cl, 0, sizeof(cl_mem), &d_Q) error -38 at ggml-opencl.cpp:1733

ggerganov commented 11 months ago

And b1600 works?

AlpinDale commented 11 months ago

I tested more, and I get a coredump with lower -c values too. Tried 2048 and 1600. It's an IOT instruction core dump.

LostRuins commented 11 months ago

Reverting this specific commit: ggml : add ggml_soft_max_ext (#4256) seems to work.

slaren commented 11 months ago

The free error suggests that this is a memory corruption issue. The changes in #4256 are not likely to be related. Running this with an ASAN build (enable LLAMA_SANITIZE_ADDRESS and LLAMA_SANITIZE_UNDEFINED) may show the source of the issue.

ggerganov commented 11 months ago

I'm able to reproduce - looking into it

AlpinDale commented 11 months ago

I built with ASan, here's the error traceback I get when running with the command:

./main -m ~/models/openhermes-2-mistral-7b.Q6_K.gguf -c 4096 -b 512 -n 32 -ngl 99 -f test.txt

Error:

Log start
main: build = 1604 (33e171d)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1701609933
ggml_opencl: clGetPlatformIDs(NPLAT, platform_ids, &n_platforms) error -1001 at ggml-opencl.cpp:965

=================================================================
==2203735==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 264 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5dee0  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205dee0)
    #2 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #3 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5b80a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205b80a)
    #2 0x7f1260a5cfaa  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205cfaa)
    #3 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #4 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5b8ae  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205b8ae)
    #2 0x7f1260a5cfaa  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205cfaa)
    #3 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #4 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5b936  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205b936)
    #2 0x7f1260a5cfaa  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205cfaa)
    #3 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #4 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e1359 in __interceptor_malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7f1260b70de2  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170de2)
    #2 0x7f1260a576bf  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576bf)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e1359 in __interceptor_malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7f1260b70de2  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170de2)
    #2 0x7f1260a576db  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576db)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Indirect leak of 320 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260b70df9  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170df9)
    #2 0x7f1260a576db  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576db)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Indirect leak of 320 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260b70df9  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170df9)
    #2 0x7f1260a576bf  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576bf)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

SUMMARY: AddressSanitizer: 1136 byte(s) leaked in 8 allocation(s).

Reverted ef47ec18da469423c276b683dd9b5741cee7023e (#4256) and re-trying now.

ggerganov commented 11 months ago

@AlpinDale When running with ASAN, you need to add this env variable: ASAN_OPTIONS=protect_shadow_gap=0 ./main .. to go through these bogus errors on init.

Doing that, I now get the following sanitizer errors, confirming a bug in ggml.c that I introduced in #4256

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1,100, frequency_penalty = 0,000, presence_penalty = 0,000
    top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
    mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
generate: n_ctx = 4096, n_batch = 512, n_predict = 32, n_keep = 0

 GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007

Copyright © 2007 Free Software Foundation, Inc. <https://fsf.org/>

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

Preamble
The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.

Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.

The precise terms and conditions for copying, distribution and modification follow.

TERMS AND CONDITIONS
0. Definitions.
“This License” refers to version 3 of the GNU General Public License.

“Copyright” also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.

“The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.

To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work.

A “covered work” means either the unmodified Program or a work based on the Program.

To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes=================================================================
==364805==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62d000fc6580 at pc 0x5620bf18802b bp 0x7fe2cf3f2840 sp 0x7fe2cf3f2830
WRITE of size 4 at 0x62d000fc6580 thread T28
    #0 0x5620bf18802a in ggml_vec_cpy_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:1158
    #1 0x5620bf22385d in ggml_compute_forward_soft_max_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:10614
    #2 0x5620bf2244aa in ggml_compute_forward_soft_max /home/ggerganov/development/github/llama.cpp/ggml.c:10668
    #3 0x5620bf25fbbe in ggml_compute_forward /home/ggerganov/development/github/llama.cpp/ggml.c:13905
    #4 0x5620bf27e361 in ggml_graph_compute_thread /home/ggerganov/development/github/llama.cpp/ggml.c:15860
    #5 0x7fe42b494ac2 in start_thread nptl/pthread_create.c:442
    #6 0x7fe42b526a3f  (/lib/x86_64-linux-gnu/libc.so.6+0x126a3f)

0x62d000fc6580 is located 0 bytes to the right of 33152-byte region [0x62d000fbe400,0x62d000fc6580)
allocated by thread T0 here:
    #0 0x7fe42ccb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5620bf14270a in __gnu_cxx::new_allocator<unsigned char>::allocate(unsigned long, void const*) /usr/include/c++/11/ext/new_allocator.h:127
    #2 0x5620bf11ee72 in std::allocator_traits<std::allocator<unsigned char> >::allocate(std::allocator<unsigned char>&, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:464
    #3 0x5620bf0ea3eb in std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long) /usr/include/c++/11/bits/stl_vector.h:346
    #4 0x5620bf0a3ffb in std::vector<unsigned char, std::allocator<unsigned char> >::_M_default_append(unsigned long) /usr/include/c++/11/bits/vector.tcc:635
    #5 0x5620bf06d1ab in std::vector<unsigned char, std::allocator<unsigned char> >::resize(unsigned long) /usr/include/c++/11/bits/stl_vector.h:940
    #6 0x5620bef398d0 in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:668
    #7 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
    #8 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
    #9 0x5620bedd4eb5 in llama_init_from_gpt_params(gpt_params&) /home/ggerganov/development/github/llama.cpp/common/common.cpp:996
    #10 0x5620bed77fc5 in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:187
    #11 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Thread T28 created by T0 here:
    #0 0x7fe42cc58685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x5620bf282b56 in ggml_graph_compute /home/ggerganov/development/github/llama.cpp/ggml.c:16094
    #2 0x5620bef3994f in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:672
    #3 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
    #4 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
    #5 0x5620bed8b2fa in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:605
    #6 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ggerganov/development/github/llama.cpp/ggml.c:1158 in ggml_vec_cpy_f32
Shadow bytes around the buggy address:
  0x0c5a801f0c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0ca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c5a801f0cb0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0ce0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0d00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==364805==ABORTING

ggerganov commented 11 months ago

Please confirm that #4307 works

LostRuins commented 11 months ago

Sorry I couldn't help more with the debugging. Anyway https://github.com/ggerganov/llama.cpp/pull/4307 seems to work for me. The segfault no longer occurs.

ggerganov commented 11 months ago

No problem - thank you very much for reporting this issue

ggerganov / llama.cpp

CLBlast fails on context lengths above 2048 after merging #4256 #4296