Closed phil-aid closed 1 year ago
I am also getting this error.
This seems to be a duplicate of #1697. Are you trying to load anything other than a q4_0 quantized model? This is not supported yet for metal (only q4_0 is supported right now).
I am getting nonsense when using Metal. Without it the models perform normally. Is anyone else seeing this? Here is my output:
(base) adam@adams-mbp bin % ./main -m /Users/adam/Documents/Projects/langchainstufff/llama.cpp/models/guanaco-7B.ggmlv3.q4_0.bin -p "I believe the meaning of life is " --ignore-eos -ngl 1 main: build = 613 (827f5ed) main: seed = 1686008097 llama.cpp: loading model from /Users/adam/Documents/Projects/langchainstufff/llama.cpp/models/guanaco-7B.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 1932.71 MB (+ 1026.00 MB per state) . llama_init_from_file: kv self size = 256.00 MB ggml_metal_init: allocating ggml_metal_init: using MPS ggml_metal_init: loading '/Users/adam/Documents/Projects/langchainstufff/llama.cpp/build-metal/bin/ggml-metal.metal' ggml_metal_init: loaded kernel_add 0x7fc13bf07080 ggml_metal_init: loaded kernel_mul 0x7fc13bf07860 ggml_metal_init: loaded kernel_mul_row 0x7fc13bf08040 ggml_metal_init: loaded kernel_scale 0x7fc13bf08820 ggml_metal_init: loaded kernel_silu 0x7fc13bf09000 ggml_metal_init: loaded kernel_relu 0x7fc13bf097e0 ggml_metal_init: loaded kernel_soft_max 0x7fc13bf09fc0 ggml_metal_init: loaded kernel_diag_mask_inf 0x7fc13bf0a7a0 ggml_metal_init: loaded kernel_get_rows_q4_0 0x7fc13bf0af80 ggml_metal_init: loaded kernel_rms_norm 0x7fc13bf0b760 ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x7fc13bf0bf40 ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x7fc13bf0c890 ggml_metal_init: loaded kernel_rope 0x7fc13bf0d070 ggml_metal_init: loaded kernel_cpy_f32_f16 0x7fc13bf0d850 ggml_metal_init: loaded kernel_cpy_f32_f32 0x7fc13bf0e030 ggml_metal_add_buffer: allocated 'data ' buffer, size = 3616.07 MB ggml_metal_add_buffer: allocated 'eval ' buffer, size = 768.00 MB ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I believe the meaning of life is 4ceratypenameculgeuenani
I think my issue is I am not on apple silicon? I have an intel cpu MacBook Pro 2018
Getting same thing, and am on Apple Silicon (m1 max, 32 core). Afaik I compiled everything correctly with LLAMA_METAL=1
and think I quantized the llama models correctly too.
Run make clean
and retry with latest master
. Make a new issue if it still fails
@ggerganov - I have tried on latest master (590250f) and afaik still seems to be failing for me with same error as here.
I have tried a few commits ago, right after the initial Metal implementation was merged (d1f563a743a83dabc11e125d4a7d64189c16498c), and my same steps are working there correctly, so I think something broke in the subsequent commits.
@rafalio I suspect you're seeing the same thing as me, which is that the subsequent change in the quantization code broke something; see https://github.com/ggerganov/llama.cpp/issues/1711
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do.Current Behavior
Please provide a detailed written description of what
llama.cpp
did, instead.Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.
Example environment info:
Example run with the Linux command perf