"Illegal Instruction" on model load on Android-Termux for v1.54+

Mizstik commented 9 months ago

In v1.54 and in latest, although it compiles successfully (no error), koboldcpp fails at the step when it's about to load the model (after having shown System Info) with "Illegal Instruction" and then exit to shell.

I did a git checkout v1.52, make clean, then make. And with this, it still runs fine. I didn't try 1.53, sorry.

Something happened after 1.52 that broke Android compatibility (aarch64) at least in Termux, and possibly other ARM-based SBCs. My particular phone is ROG Phone 6. (SD8 Gen 1, 16 GB RAM)

LostRuins commented 9 months ago

Sorry, can I check which model you tried this with?

Mizstik commented 9 months ago

A bunch of models I happened to have on the phone at the time. If I remember correctly:

mistral-7b-instruct-v0.2.Q3_K_L.gguf guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin orca_mini_v3_7b.ggmlv3.q3_K_M.bin

They all worked after I reverted to v1.52.

Dravoss commented 9 months ago

I have the very same problem, i am running it on an android with Aarch64 on termux and the last version that is working in my phone is 1.53, i have tried a bunch of models and to download the .zip manually and building it with make but i always got the same log when i run it:

Welcome to KoboldCpp - Version 1.54 Warning: OpenBLAS library file not found. Non-BLAS library will be used. Initializing dynamic library: koboldcpp_default.so

Namespace(model=None, model_param='model1-7b.Q5_K_M.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=3, blasthreads=3, highpriority=False, contextsize=2048, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=None, gpulayers=0, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: /data/data/com.termux/files/home/koboldcpp-1.54/model1-7b.Q5_K_M.gguf [Threads: 3, BlasThreads: 3, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

Identified as LLAMA model: (ver 6) Attempting to Load...

gustrd commented 9 months ago

I wonder if it's a problem with a Termux update. Can you still build the older version?

Mizstik commented 9 months ago

I wasn't updating Termux (or any package) at all when I discovered this issue. It's typical to freeze Termux updates because we don't want the F-Droid version to be overwritten by the Play Store version. I haven't run pkg update in forever either and only git pulled koboldcpp. (Though after I discovered the issue, I went around to update everything to see if any updated packages would fix the issue but no dice.)

So after all that, I tried downgrading to v1.52 and that works. The other guy above said that v1.53 also works. v1.54 and above don't work.

Also note that they all built fine. They fail at runtime, during model load.

Dravoss commented 9 months ago

I wonder if it's a problem with a Termux update. Can you still build the older version?

Yes 1.53 still builds and run just fine

LostRuins commented 9 months ago

I will take a look

Dravoss commented 9 months ago

I will take a look

Tell me if you need me to test anything.

LostRuins commented 9 months ago

Hi everyone, I just did a clean install on the latest experimental for guanaco and phi, both seem to be working. How did you setup the install? Can you try these steps:

Install and run Termux from F-Droid
Enter the command termux-change-repo and choose Mirror by BFSU
Install dependencies with pkg install wget git python (plus any other missing packages)
Install dependencies apt install openssl (if needed)
Clone the experimental branch of the repo git clone -b concedo_experimental https://github.com/LostRuins/koboldcpp.git
Navigate to the koboldcpp folder cd koboldcpp
Build the project make
Grab a small GGUF model, such as wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q2_K.gguf
Start the python server python koboldcpp.py --model phi-2.Q2_K.gguf
Connect to http://localhost:5001 on your mobile browser

Tell me:

If it works, good
If it fails, which part didn't work, and what error do you get?

I'm wondering if someone somewhere pushed a bad package for something. Would be good to describe how you originally setup vs the above steps if there's any differences.

Dravoss commented 9 months ago

Unfortunately i am getting the same error.

For my original setups i tried to clone the hub and manually downloading the zip from releases but both failed to load the model.

This is the log after following your last post:

~/koboldcpp $ python koboldcpp.py --model phi-2.Q2_K.gguf

Welcome to KoboldCpp - Version 1.56 Warning: OpenBLAS library file not found. Non-BLAS library will be used. Initializing dynamic library: koboldcpp_default.so

Namespace(model='phi-2.Q2_K.gguf', model_param='phi-2.Q2_K.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=3, blasthreads=3, highpriority=False, contextsize=2048, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=None, gpulayers=0, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: /data/data/com.termux/files/home/koboldcpp/phi-2.Q2_K.gguf [Threads: 3, BlasThreads: 3, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: phi2

Identified as LLAMA model: (ver 6) Attempting to Load...

Mizstik commented 9 months ago

Still no go. Here's my Termux log from start to finish. I uninstalled Termux and then reinstalled again fresh from F-Droid. The log starts from changing the repo and including the compile outputs, wget, and the koboldcpp execution up until the last line which is Illegal instruction.

termux_transcript.txt

Mizstik commented 9 months ago

@LostRuins Which phone are you using? If I have the same or a similar phone, I can go try.

LostRuins commented 9 months ago

I was testing on a Samsung Galaxy S9 Plus

Mizstik commented 9 months ago

Hmm I don't have that one but I do have a phone also with SD845 (Poco F1). I'll give it a try.

Dravoss commented 9 months ago

Mine is a rog phone 6 with aarch64:

Android 12 Qualcomm® Snapdragon® 8+ Gen 1 Mobile Platform Qualcomm® Adreno™ 730 LPDDR5 16GB

Mizstik commented 9 months ago

Also ROG Phone 6 here but I have Android 13 & December security update.

Dravoss commented 9 months ago

Also ROG Phone 6 here but I have Android 13 & December security update.

I do not want to derrail the thread but are you into rog beta program? I have latest security update but mine didn't got android 13 even manually checking for it.

Mizstik commented 9 months ago

Also ROG Phone 6 here but I have Android 13 & December security update.

I do not want to derrail the thread but are you into rog beta program? I have latest security update but mine didn't got android 13 even manually checking for it.

I don't think I'm in the beta program. I don't remember applying at least. Perhaps it's a staggered deployment.

Mizstik commented 9 months ago

I also got Illegal instruction on the Poco F1 but curiously it got further into the loading process before failing.

~/koboldcpp $ python koboldcpp.py --model phi-2.Q2_K.gguf
***
Welcome to KoboldCpp - Version 1.56
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model='phi-2.Q2_K.gguf', model_param='phi-2.Q2_K.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=3, blasthreads=3, highpriority=False, contextsize=2048, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=None, gpulayers=0, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)
==========
Loading model: /data/data/com.termux/files/home/koboldcpp/phi-2.Q2_K.gguf
[Threads: 3, BlasThreads: 3, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: phi2

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /data/data/com.termux/files/home/koboldcpp/phi-2.Q2_K.gguf (version GGUF V3 (latest))
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 1.09 GiB (3.37 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1117.52 MiB
..........................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   665.00 MiB
llama_new_context_with_model: KV self size  =  665.00 MiB, K (f16):  332.50 MiB, V (f16):  332.50 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model:        CPU compute buffer size =   177.16 MiB
Illegal instruction
~/koboldcpp $

Once again I tried reverting to v1.53 and now it works:

[-Wunused-command-line-argument]                                Your OS  does not appear to be Windows. For faster speeds, install and link a BLAS library. Set LLAMA_OPENBLAS=1 to compile with OpenBLAS support or LLAMA_CLBLAST=1 to compile with ClBlast support. This is just a reminder, not an error.~/koboldcpp $ python koboldcpp.py --model phi-2.Q2_K.gguf
***
Welcome to KoboldCpp - Version 1.53
Warning: OpenBLAS library file not found. Non-BLAS library will be used.                                              Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model='phi-2.Q2_K.gguf', model_param='phi-2.Q2_K.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=3, blasthreads=3, highpriority=False, contextsize=2048, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=None, gpulayers=0, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)
==========
Loading model: /data/data/com.termux/files/home/koboldcpp/phi-2.Q2_K.gguf                                             [Threads: 3, BlasThreads: 3, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: phi2                            
---
Identified as LLAMA model: (ver 6)
Attempting to Load...                                      ---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /data/data/com.termux/files/home/koboldcpp/phi-2.Q2_K.gguf (version GGUF V3 (latest))
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).                                     llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200              llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32                 llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_gqa            = 1                  llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00            llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear             llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown            llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 2.78 B             llm_load_print_meta: model size       = 1.09 GiB (3.37 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'                                                         llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'            llm_load_tensors: ggml ctx size       =    0.12 MiB        llm_load_tensors: system memory used  = 1117.64 MiB
..........................................................................................                            Automatic RoPE Scaling: Using (scale:1.000, base:10000.0). llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0         llama_new_context_with_model: freq_scale = 1               llama_new_context_with_model: KV self size  =  665.00 MiB, K (f16):  332.50 MiB, V (f16):  332.50 MiB                 llama_build_graph: non-view tensors processed: 774/774
llama_new_context_with_model: compute buffer total size = 170.35 MiB                                                  Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======                                                     Please connect to custom endpoint at http://localhost:5001

LostRuins commented 9 months ago

Okay I noticed one interesting change - the value detected for FP16_VA was previously "0" when it worked, and seems like its "1" when it fails for you. @Dravoss do you observe the same situation also? (Trying older working version shows your FP16_VA=0 and newer crashing version shows FP16_VA=1)?

LostRuins commented 9 months ago

I wanna know which version did it break, and did the FP16_VA change happen at the same version too.

Edit: I switched to a newer device and have managed to repro the same issue.

Dravoss commented 9 months ago

I wanna know which version did it break, and did the FP16_VA change happen at the same version too.

Edit: I switched to a newer device and have managed to repro the same issue.

FP16_VA = 1 for both 1.54 and experimental versions and FP16_VA = 0 in working 1.53 ver.

LostRuins commented 9 months ago

So 1.53 works, and 1.54 doesn't?

Dravoss commented 9 months ago

So 1.53 works, and 1.54 doesn't?

Yes

LostRuins commented 9 months ago

Thanks. I am slowly trying to find the offending commit.

LostRuins commented 9 months ago

Okay I think I found the issue. Can you try to pull the latest commit from my concedo_experimental branch? It is working on my new device now.

Dravoss commented 9 months ago

Okay I think I found the issue. Can you try to pull the latest commit from my concedo_experimental branch? It is working on my new device now.

It seems to work now, thank you for your hard work, i will try other models just to make sure and if anytime you need help testing koboldcpp on android let me know.

Mizstik commented 9 months ago

Confirmed working from here as well. Thank you!

Mizstik commented 9 months ago

A bit of a side note here, but digging into the rabbit hole over at https://github.com/ggerganov/llama.cpp/issues/402 and some experimentation, I found that

CFLAGS   += -march=armv9-a+nosve
CXXFLAGS += -march=armv9-a+nosve

works for SD8 Gen 1 with FP16_VA = 1. Performance compared to generic is about +28% on phi-2. (~6.39 T/s vs. ~5 T/s)

LostRuins / koboldcpp

"Illegal Instruction" on model load on Android-Termux for v1.54+ #624