Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.01k stars 973 forks source link

Segfault in WIN32 API upon OOM #558

Closed safeswap closed 2 weeks ago

safeswap commented 2 weeks ago

Contact Details

safeswapio@gmail.com

What happened?

The "llamafile" is not functioning properly, my system is windows11

Version

llamafile v0.8.13

What operating system are you seeing the problem on?

No response

Relevant log output

D:\download>llamafile-0.8.13.exe -m Phi-3.5-mini-instruct-Q8_0.gguf --gpu DISABLE
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2841,"msg":"build info","tid":"11681088","timestamp":1725033601}
{"function":"server_cli","level":"INFO","line":2844,"msg":"system info","n_threads":12,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11681088","timestamp":1725033601,"total_threads":24}
llama_model_loader: loaded meta data with 40 key-value pairs and 197 tensors from Phi-3.5-mini-instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 3.5 Mini Instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = Phi-3.5
llama_model_loader: - kv   5:                         general.size_label str              = mini
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   8:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv   9:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv  10:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  11:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  12:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  13:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  14:                           phi3.block_count u32              = 32
llama_model_loader: - kv  15:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv  16:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv  17:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  19:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  20:                          general.file_type u32              = 7
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = /models_out/Phi-3.5-mini-instruct-GGU...
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 128
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 151
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q8_0:  130 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 262144
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 3.78 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Phi 3.5 Mini Instruct
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  3872.38 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size = 49152.00 MiB
llama_new_context_with_model: KV self size  = 49152.00 MiB, K (f16): 24576.00 MiB, V (f16): 24576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =  8484.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1
{"function":"initialize","level":"INFO","line":491,"msg":"initializing slots","n_slots":1,"tid":"11681088","timestamp":1725033621}
{"function":"initialize","level":"INFO","line":500,"msg":"new slot","n_ctx_slot":131072,"slot_id":0,"tid":"11681088","timestamp":1725033621}
{"function":"server_cli","level":"INFO","line":3062,"msg":"model loaded","tid":"11681088","timestamp":1725033621}

error: Uncaught SIGSEGV (SEGV_1073807366) at 0x7ffb92292bdc on DESKTOP-KSKI7LK pid 15640 tid 19172
  llamafile-0.8.13.exe
  Function not implemented
  Windows Cosmopolitan 3.7.1 MODE=x86_64 DESKTOP-KSKI7LK 10.0

RAX 00007ffb9479b4b5 RBX 0000000000000001 RDI 000000007ffe0384
RCX 00007ffb00000000 RDX 0000000003423250 RSI 0000000000000000
RBP 00007000007db408 RSP 00007000007da460 RIP 00007ffb92292bdc
 R8 0000000000143740  R9 00007000007d9e30 R10 0000000000000000
R11 0000000000141d90 R12 00007ffb753d0000 R13 00007ffb753e7a38
R14 00007ffb753d0000 R15 0000000000000000
TLS 0000000000b0bdc0

XMM0  00000000000000000000000000000001 XMM8  00000000000000000000000000000000
XMM1  00007000007d9e98000000000014ce60 XMM9  00000000000000000000000000000000
XMM2  00007ffb947cf01500000000dfa00091 XMM10 00000000000000000000000000000000
XMM3  00007ffb914800e8000000007ffe0384 XMM11 00000000000000000000000000000000
XMM4  0000000000000000000000000014ce60 XMM12 00000000000000000000000000000000
XMM5  00007ffb948ac3f000007000007d9ee0 XMM13 00000000000000000000000000000000
XMM6  226c6576656c222c22696c635f726576 XMM14 00000000000000000000000000000000
XMM7  726573223a226e6f6974636e7566227b XMM15 00000000000000000000000000000000

cosmoaddr2line /D/download/llamafile-0.8.13.exe 7ffb92292bdc 100008c

000000b19d10 7ffb92292bdc NULL+0
7000007db408 100008c g_events+4639628

000000400000-000000afc1f8 r-x-- 7152kb
000000afd000-0000031bc000 rw--- 39mb
0000031c0000-0000031d0000 rw-Pa 64kb hand=296
0006fe000000-0006fe010000 rw-pa 64kb hand=300
0050473e0000-0050473e1000 ---pa 4096b hand=404
0050473e1000-005047400000 rw-pa 124kb
026a9a790000-026a9ad90000 rw-pa 6144kb hand=576
037471040000-037471040fc0 rw-pa 4032b hand=684
03f104910000-03f104911000 ---pa 4096b hand=588
03f104911000-03f104930000 rw-pa 124kb
0652182e0000-0652182e0fc0 rw-pa 4032b hand=676
074e8f7b0000-074e8f7b1000 ---pa 4096b hand=604
074e8f7b1000-074e8f7d0000 rw-pa 124kb
09c720800000-09c720801000 ---pa 4096b hand=556
09c720801000-09c720820000 rw-pa 124kb
# 64'899'010'560 bytes in 78 mappings
llamafile-0.8.13.exe -m Phi-3.5-mini-instruct-Q8_0.gguf --gpu DISABLE
Terminating on masked SIGSEGV. Pass --strace and/or ShowCrashReports() for details.
jart commented 2 weeks ago

Could you try passing the flag -c 2048 to limit the context size?

jart commented 2 weeks ago

Also the crash you reported appears to be happening inside the WIN32 API.

inforithmics commented 2 weeks ago

Could this be a problem llama_kv_cache_init: CPU KV buffer size = 49152.00 MiB (allocates around 50 GB) of ram for the kv cache alone) after i assured that I had at least 59 GB of memory free it worked.

-c 2048 helped too it then only used 768.00 MiB for kv cache

jart commented 2 weeks ago

Glad context fixed it. Since the OOM crash is in WIN32 code I don't think this is actionable for us. Thanks for the report. It's a known issue re: making these 128k context models less surprising w.r.t. memory requirements. We're working on that too.

safeswap commented 2 weeks ago

-c 2048 didn't work for me , hope you can fix this problem soon.