Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.39k stars 982 forks source link

Illegal Instruction when running a llamafile #413

Closed cdamiens closed 4 months ago

cdamiens commented 4 months ago

Hi,

Issue:

I tried to run llava-v1.5-7b-q4.llamafile or TinyLlama-1.1B-Chat-v1.0.F16.llamafile on my system: _Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x8664 GNU/Linux

But I encountered the same error at the same step for both:

stdout:

$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading {"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433} {"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4} llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: llama.block_count u32 = 22 llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 5: llama.attention.head_count u32 = 32 llama_model_loader: - kv 6: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 7: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 9: general.file_type u32 = 1 llama_model_loader: - kv 10: llama.vocab_size u32 = 32000 llama_model_loader: - kv 11: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.pre str = default llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type f16: 156 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 2.05 GiB (16.00 BPW) llm_load_print_meta: general.name = n/a llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 2 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.10 MiB llm_load_tensors: CPU buffer size = 2098.35 MiB .......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 11.00 MiB llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB llama_new_context_with_model: CPU output buffer size = 0.13 MiB llama_new_context_with_model: CPU compute buffer size = 66.50 MiB llama_new_context_with_model: graph nodes = 710 llama_new_context_with_model: graph splits = 1 Instruction non permise (core dumped)

llama.log content:

$ cat llama.log warming up the model with an empty run

lscpu

It seems to be CPU related, so here is my lscpu:

$ lscpu Architecture : x86_64 Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit Address sizes: 36 bits physical, 48 bits virtual Boutisme : Little Endian Processeur(s) : 4 Liste de processeur(s) en ligne : 0-3 Identifiant constructeur : GenuineIntel Nom de modèle : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz Famille de processeur : 6 Modèle : 42 Thread(s) par cœur : 1 Cœur(s) par socket : 4 Socket(s) : 1 Révision : 7 Vitesse maximale du processeur en MHz : 3700,0000 Vitesse minimale du processeur en MHz : 1600,0000 BogoMIPS : 6619.18 Drapaux : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm i md_clear flush_l1d Virtualization features:
Virtualisation : VT-x Caches (sum of all):
L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 6 MiB (1 instance) NUMA:
Nœud(s) NUMA : 1 Nœud NUMA 0 de processeur(s) : 0-3 Vulnerabilities:
Gather data sampling: Not affected Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Mds: Mitigation; Clear CPU buffers; SMT disabled Meltdown: Mitigation; PTI Mmio stale data: Unknown: No mitigations Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

I saw a similar issue with a similar CPU: Support broken on old Intel/Amd CPUs #25. But as it does not crash at the same step, I was wondering if it could be related.

cdamiens commented 4 months ago

Last stdout lines with --ftrace flag:

$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile --ftrace FUN 7143 7143 127'676'693'461 -123'127'225'490'312 &ggml_get_n_tasks.part.0 FUN 7143 7222 127'676'694'743 688 &ggml_get_n_tasks.part.0 FUN 7143 7223 127'676'695'076 1'088 &ggml_compute_forward_mul_mat FUN 7143 7224 127'676'695'958 688 &ggml_compute_forward FUN 7143 7143 127'676'697'768 -123'127'225'490'312 &ggml_compute_forward FUN 7143 7222 127'676'698'572 688 &ggml_compute_forward FUN 7143 7224 127'676'700'804 1'088 &ggml_compute_forward_mul_mat FUN 7143 7223 127'676'700'698 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'702'180 -123'127'225'489'912 &ggml_compute_forward_mul_mat FUN 7143 7222 127'676'703'139 1'088 &ggml_compute_forward_mul_mat FUN 7143 7224 127'676'704'676 1'712 &llamafile_sgemm_amd_avx FUN 7143 7223 127'676'705'968 1'632 &ggml_syncthreads FUN 7143 7143 127'676'707'146 -123'127'225'489'288 &llamafile_sgemm_amd_avx FUN 7143 7222 127'676'708'192 1'712 &llamafile_sgemm_amd_avx FUN 7143 7224 127'676'709'142 1'632 &ggml_syncthreads FUN 7143 7143 127'676'711'551 -123'127'225'489'368 &ggml_fp32_to_fp16_row_amd_avx FUN 7143 7222 127'676'712'329 1'632 &ggml_fp32_to_fp16_row_amd_avx FUN 7143 7223 127'676'718'666 1'696 &sched_yield FUN 7143 7224 127'676'722'670 1'696 &sched_yield FUN 7143 7143 127'676'722'489 -123'127'225'489'368 &ggml_syncthreads FUN 7143 7222 127'676'723'178 1'632 &ggml_syncthreads FUN 7143 7222 127'676'727'610 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'728'117 -123'127'225'489'288 &llamafile_sgemm_amd_avx FUN 7143 7222 127'676'731'582 1'888 &_ZN12_GLOBALN_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7223 127'676'733'826 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'736'033 -123'127'225'489'112 &_ZN12_GLOBALN_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7222 127'676'736'916 1'968 &_ZN12_GLOBALN_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll FUN 7143 7223 127'676'737'875 1'888 &_ZN12_GLOBALN_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7224 127'676'739'365 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'739'291 -123'127'225'489'032 &_ZN12_GLOBALN_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll FUN 7143 7223 127'676'741'568 1'968 &_ZN12_GLOBALN_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll FUN 7143 7224 127'676'742'748 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7224 127'676'746'386 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll Instruction non permise (core dumped)

jart commented 4 months ago

OK you have a sandybridge CPU. Five years EOL but still supported by us. Could you run ./llava-v1.5-7b-q4.llamafile --version and tell me what it says? It'd help to know what version of llamafile your llamafiles are.

cdamiens commented 4 months ago

Hi, Sure it's an old rig 😉 Sufficient for daily tasks, but outdated for modern AI experimentation...

Here are the information: $ ./llava-v1.5-7b-q4.llamafile --version llamafile v0.8.4

Note: I had to download APE / APE-jart and register them.

DjagbleyEmmanuel commented 4 months ago

Same thing here 

newca12 commented 4 months ago
It seems to be a regression between version 0.7.0 and version 0.8.0 Reproduced with Xeon E5-2407 (sandybridge) [Everything is fine with Xeon® Silver 4108 (skylake)] model version status
mistral-7b-instruct-v0.2.Q5_K_M.llamafile llamafile v0.7.0 OK
mistral-7b-instruct-v0.2.Q4_0.llamafile llamafile v0.8.0 Illegal instruction (core dumped)
jart commented 4 months ago

I see what the issue is here. I've confirmed a fix is incoming.

jart commented 4 months ago

Please be warned that once this fix goes live, using f16 weights on a sandybridge cpu that doesn't have the f16c isa, while it will no longer crash, it will almost certainly go very slow. You'll most likely be better served using the q4 weights on an older cpu.