Open ghost opened 1 year ago
I tried the following:
build: LLAMA_METAL=1 make falcon_main falcon_quantize falcon_perplexity
then run the model with: ./falcon_main -t 4 -ngl 100 -b 1 -m ../Models/WizardLM-Uncensored-Falcon-7B-GGML/wizardlm-7b-uncensored.ggccv1.q4_0.bin -enc -p "write a story about llamas"
It outputs:
main: build = 883 (2b487f2)
falcon.cpp: loading model from ../Models/WizardLM-Uncensored-Falcon-7B-GGML/wizardlm-7b-uncensored.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
| Info | format | n_vocab | n_bpe | n_ctx | n_embd | n_head ; kv | n_layer | falcon | ftype | n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
| | ggcc v1 | 65024 | 64784 | 2048 | 4544 | 71 ; 1 | 32 | 7; 7B | 2 | 18176 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 3872.00 MB)
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: mem required = 4196.81 MB (+ 48.00 MB per state)
[==================================================] 100% Tensors populated
falcon_context_prepare: Context falcon_main RAM buffers - key_val = 16.00 MB, Compute = 160.00 MB, Scratch 0 = 124.00 MB, Scratch 1 = 40.14 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/SanDisk/ggllm.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x14160b850
ggml_metal_init: loaded kernel_mul 0x14160bf70
ggml_metal_init: loaded kernel_mul_row 0x14160c5a0
ggml_metal_init: loaded kernel_scale 0x14160cac0
ggml_metal_init: loaded kernel_silu 0x14160cfe0
ggml_metal_init: loaded kernel_relu 0x14160d500
ggml_metal_init: loaded kernel_gelu 0x14160da20
ggml_metal_init: loaded kernel_soft_max 0x14160e0d0
ggml_metal_init: loaded kernel_diag_mask_inf 0x14160e730
ggml_metal_init: loaded kernel_get_rows_f16 0x14160edb0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x14160f430
ggml_metal_init: loaded kernel_get_rows_q4_1 0x14160fc20
ggml_metal_init: loaded kernel_get_rows_q2_k 0x1416102a0
ggml_metal_init: loaded kernel_get_rows_q3_k 0x141610920
ggml_metal_init: loaded kernel_get_rows_q4_k 0x141610fa0
ggml_metal_init: loaded kernel_get_rows_q5_k 0x141611620
ggml_metal_init: loaded kernel_get_rows_q6_k 0x141611ca0
ggml_metal_init: loaded kernel_rms_norm 0x141612350
ggml_metal_init: loaded kernel_norm 0x141612a00
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x1416133d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x141613ab0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x141614190
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x141614870
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32 0x1416150f0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x1416157d0
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32 0x141615eb0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x141616590
ggml_metal_init: loaded kernel_rope 0x141617080
ggml_metal_init: loaded kernel_alibi_f32 0x141617940
ggml_metal_init: loaded kernel_cpy_f32_f16 0x1416181d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x141618a60
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1416192f0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3874.44 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 160.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 48.02 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 124.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 40.14 MB
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| 4/10 thrd | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
| Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
| | 64 | 1.100 | 0.000 | 0.000 | 40 | 1.000 | 0.950 | 1.000 | 0.80 | 0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation | Ctx | Batch | Keep | Prom. | Seed | Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
| | 2048 | 1 | 0 | 10 | 1692449979 | WIZARD | # 1 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
GGML_ASSERT: ggml-metal.m:530: ne02 == ne12
GGML_ASSERT: ggml-metal.m:530: ne02 == ne12
zsh: abort ./falcon_main -t 4 -ngl 100 -b 1 -m -enc -p "write a story about llamas"
Same issue here... I'll try to convert the model to other types.
ggml and llama.cpp support Metal, do Apple Silicon users need to use LLaMA.cpp or can they use gglm.cpp with Falcon?