ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.19k stars 9.64k forks source link

[User] CLBlast crash with q2_K model #1804

Closed ghost closed 1 year ago

ghost commented 1 year ago

Prerequisites

Expected Behavior

Run ./main and continue without crash.

Current Behavior

The issue appears to be with q2_K. I tried the same prompt with the identical q4_1 model, and it works as expected. OpenBlas build works as expected.

GGML_ASSERT: /data/data/com.termux/files/home/cllama/ggml-opencl.cpp:1018: to_fp32_cl != nullptr
fish: Job 1, 'LD_LIBRARY_PATH=/vendor/lib64 .…' terminated by signal SIGABRT (Abort)

The crash is inconsistent, sometimes crash before I can interact, and sometimes crashing after one or more messages.

Run 1: (-ngl 1)

LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -i -ins -b 10 -ngl 1           
main: build = 0 (unknown)
main: seed  = 1686500268
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 4383.18 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 1 layers to GPU
llama_model_load_internal: total VRAM used: 81 MB
.....
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 10, n_predict = -1, n_keep = 2

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 GGML_ASSERT: /data/data/com.termux/files/home/cllama/ggml-opencl.cpp:1018: to_fp32_cl != nullptr
fish: Job 1, 'LD_LIBRARY_PATH=/vendor/lib64 .…' terminated by signal SIGABRT (Abort)
Run 2: (No -ngl)

LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -i -ins
main: build = 0 (unknown)
main: seed  = 1686500412
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 4464.12 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 0 MB
.
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

> test test this is a test message *plenty of characters and numbers 1028748*
GGML_ASSERT: /data/data/com.termux/files/home/cllama/ggml-opencl.cpp:1018: to_fp32_cl != nullptr
fish: Job 1, 'LD_LIBRARY_PATH=/vendor/lib64 .…' terminated by signal SIGABRT (Abort)

Environment and Context

lscpu;

Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              Qualcomm
  Model name:           Kryo-4XX-Silver
    Model:              14
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           0xd
    CPU(s) scaling MHz: 62%
    CPU max MHz:        1785.6000
    CPU min MHz:        300.0000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull
                         sha1 sha2 crc32 atomics f
                        php asimdhp cpuid asimdrdm
                         lrcpc dcpop asimddp
  Model name:           Kryo-4XX-Gold
    Model:              14
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          2
    Stepping:           0xd
    CPU(s) scaling MHz: 71%
    CPU max MHz:        2841.6001
    CPU min MHz:        710.4000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull
                         sha1 sha2 crc32 atomics f
                        php asimdhp cpuid asimdrdm
                         lrcpc dcpop asimddp
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Vulnerable
  Spec store bypass:    Vulnerable
  Spectre v1:           Mitigation; __user pointer
                         sanitization
  Spectre v2:           Mitigation; Branch predict
                        or hardening
  Srbds:                Not affected
  Tsx async abort:      Not affected

uname -a

Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android

make --versionGNU Make 4.4.1 Built for aarch64-unknown-linux-android

g++ --version
clang version 16.0.5
Target: aarch64-unknown-linux-android24 Thread model: posix InstalledDir: /data/data/com.termux/files/usr/bin



# Steps to Reproduce

1. build llama.cpp with CLBlast
2. open the q2_K model

Thanks.
DocShotgun commented 1 year ago

From my understanding, k-quants in general are just not functioning with CLBlast GPU offloading right now unfortunately.

ikawrakow commented 1 year ago

OpenCL (and CLBlast) support for k-quants is being added with PR #1836