Closed ghost closed 1 year ago
Run ./main and continue without crash.
The issue appears to be with q2_K. I tried the same prompt with the identical q4_1 model, and it works as expected. OpenBlas build works as expected.
GGML_ASSERT: /data/data/com.termux/files/home/cllama/ggml-opencl.cpp:1018: to_fp32_cl != nullptr fish: Job 1, 'LD_LIBRARY_PATH=/vendor/lib64 .…' terminated by signal SIGABRT (Abort)
The crash is inconsistent, sometimes crash before I can interact, and sometimes crashing after one or more messages.
Run 1: (-ngl 1) LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -i -ins -b 10 -ngl 1 main: build = 0 (unknown) main: seed = 1686500268 ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)' ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)' ggml_opencl: device FP16 support: true llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 10 (mostly Q2_K) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: using OpenCL for GPU acceleration llama_model_load_internal: mem required = 4383.18 MB (+ 1026.00 MB per state) llama_model_load_internal: offloading 1 layers to GPU llama_model_load_internal: total VRAM used: 81 MB ..... llama_init_from_file: kv self size = 1024.00 MB system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction: ' sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 10, n_predict = -1, n_keep = 2 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. GGML_ASSERT: /data/data/com.termux/files/home/cllama/ggml-opencl.cpp:1018: to_fp32_cl != nullptr fish: Job 1, 'LD_LIBRARY_PATH=/vendor/lib64 .…' terminated by signal SIGABRT (Abort)
Run 2: (No -ngl) LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -i -ins main: build = 0 (unknown) main: seed = 1686500412 ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)' ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)' ggml_opencl: device FP16 support: true llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 10 (mostly Q2_K) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: using OpenCL for GPU acceleration llama_model_load_internal: mem required = 4464.12 MB (+ 1026.00 MB per state) llama_model_load_internal: offloading 0 layers to GPU llama_model_load_internal: total VRAM used: 0 MB . llama_init_from_file: kv self size = 1024.00 MB system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction: ' sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. > test test this is a test message *plenty of characters and numbers 1028748* GGML_ASSERT: /data/data/com.termux/files/home/cllama/ggml-opencl.cpp:1018: to_fp32_cl != nullptr fish: Job 1, 'LD_LIBRARY_PATH=/vendor/lib64 .…' terminated by signal SIGABRT (Abort)
lscpu;
Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: Qualcomm Model name: Kryo-4XX-Silver Model: 14 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 0xd CPU(s) scaling MHz: 62% CPU max MHz: 1785.6000 CPU min MHz: 300.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics f php asimdhp cpuid asimdrdm lrcpc dcpop asimddp Model name: Kryo-4XX-Gold Model: 14 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 Stepping: 0xd CPU(s) scaling MHz: 71% CPU max MHz: 2841.6001 CPU min MHz: 710.4000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics f php asimdhp cpuid asimdrdm lrcpc dcpop asimddp Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Vulnerable Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; Branch predict or hardening Srbds: Not affected Tsx async abort: Not affected
uname -a
Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android
python3 --version The program python3 is not installed.
make --versionGNU Make 4.4.1 Built for aarch64-unknown-linux-android
g++ --version clang version 16.0.5 Target: aarch64-unknown-linux-android24 Thread model: posix InstalledDir: /data/data/com.termux/files/usr/bin
# Steps to Reproduce 1. build llama.cpp with CLBlast 2. open the q2_K model Thanks.
From my understanding, k-quants in general are just not functioning with CLBlast GPU offloading right now unfortunately.
OpenCL (and CLBlast) support for k-quants is being added with PR #1836
Prerequisites
Expected Behavior
Run ./main and continue without crash.
Current Behavior
The issue appears to be with q2_K. I tried the same prompt with the identical q4_1 model, and it works as expected. OpenBlas build works as expected.
The crash is inconsistent, sometimes crash before I can interact, and sometimes crashing after one or more messages.
Environment and Context
lscpu;
uname -a
make --versionGNU Make 4.4.1 Built for aarch64-unknown-linux-android
g++ --version
clang version 16.0.5
Target: aarch64-unknown-linux-android24 Thread model: posix InstalledDir: /data/data/com.termux/files/usr/bin