fann1993814 commented 9 months ago

Library setting
- [x] Add new metal setting in CMakeLists. (follow the CMakeLists.txt in llama.cpp)
Update Functions for metal support.
- [x] Set initial metal context in load model weight
- [x] Add buffer data from the context built from original graph in rwkv_graph
- [x] Add new compute_graph for metal in rwkv_eval
- [ ] Update operations for metal can use in rwkv_operators. (because many ggml_op cannot compute in metal, and I found the max function cannot be re-write from other ggml-functions for metal backend can be supported.)

134

fdagostino commented 8 months ago

Hi @fann1993814, thanks for implementing this!

I've tested this with model: RWKV-5-World 7B (RWKV-5-World-7B-v2-OnlyForTest_49%_trained-20231114-ctx4096-Q4_0) I've converted it to ggml FP16 (FP32 too) and then quantized to Q4_0.

Apple M2 16GB (8 cores)

I'm getting the following error:

rwkv.cpp % python python/chat_with_bot.py RWKV-5-World-7B-v2-OnlyForTest_49%_trained-20231114-ctx4096-Q4_0.bin

System info: AVX=0 AVX2=0 AVX512=0 FMA=0 NEON=1 ARM_FMA=1 F16C=0 FP16_VA=1 WASM_SIMD=0 BLAS=1 SSE3=0 VSX=0
Loading RWKV model
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: loading '/rwkv.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x15732b230 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x16083e7f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x16083e9e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x16083ebd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x16083edc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x16083efb0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x16083f1a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x16083f390 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x16083f580 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_4                     0x16083fb80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x160840160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf_8                0x1608408b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f32                   0x160840ee0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x160841510 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x160841b40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x160842170 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0                  0x1608427a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x160842dd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x160843400 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x160843ba0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x1608441d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x160844800 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x160844e40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x157209fb0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f32_f32                0x157235030 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x160845500 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row           0x160845e00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4             0x1608467e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x157f66db0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x157f675f0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32               0x157f67e20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x160846da0 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x160847360 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x160847a40 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x160848120 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x160848800 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f32_f32                 0x160848f90 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x160849720 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x160849eb0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                0x16084a640 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x16084add0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x16084b560 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x16084bcf0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x16084c480 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x16084cc10 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x16084d3a0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_rope                           0x16084d990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x16084e4e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x16084ecf0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x16084f500 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x16084fd10 | th_max = 1024 | th_width =   32
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_add_buffer: allocated 'weight_data     ' buffer, size =  4805.00 MB, ( 4805.50 / 10922.67)
ggml_metal_add_buffer: allocated 'serial_computer_buffer' buffer, size =   273.22 MB, ( 5078.72 / 10922.67)
Loading World v20230424 tokenizer
Processing 178 prompt tokens, may take a while
ggml_metal_add_buffer: allocated 'sequential_computer_buffer' buffer, size =  2896.44 MB, ( 7975.16 / 10922.67)
ggml_metal_graph_compute_block_invoke: node   2, op =   REPEAT not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
ggml_metal_graph_compute_block_invoke: node 2285, op = MAP_CUSTOM1 not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
ggml_metal_graph_compute_block_invoke: node 4571, op = MAP_CUSTOM1 not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
ggml_metal_graph_compute_block_invoke: node 6855, op = MAP_CUSTOM1 not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
zsh: abort      python python/chat_with_bot.py

Any idea?

fann1993814 commented 8 months ago

@fann1993814

Hi @fdagostino , this pr is not ready. because I don't know how to implement the some operations into metal. Operation is just like tensors compute ADD or SUB... Some of Operation are too difficult to implement (REPEAT, or SET, also include RWKV_v5_wkv...)

Maybe can need ask @saharNooby how to implement...?

saharNooby commented 7 months ago

Can't ggml just not use Metal for simple operations like ADD, SUB, etc. -- the same way it is for GPU offloading?

I'm afraid I would not be able neither implement, nor help with implementing operations for Metal -- among other reasons, I don't have a Mac.

The situation may change with the next ggml update in rwkv.cpp, but this is unlikely.

fann1993814 commented 7 months ago

Hi @saharNooby

Can't ggml just not use Metal for simple operations like ADD, SUB, etc. -- the same way it is for GPU offloading?

Metal need to be implemented kernel functions, you can see those functions in src/ggml/ggml-metal.metal It's different with cuda (cublas), because it only focus on BLAS computing. (Maybe GEMM,...,..) Metal is not just BLAS, It support all the operations for General-Purpose computing.

A graph need to be translate ggml-operations to metal-related operations. Otherwise, metal runtime cannot understand how to do.

Implement some basic kernel functions is easy, like SUB/DIV/SQT or some custom operationsSIGMOID/OneMinusX. However, REPEAT/SET/WKV_for_v5 are difficult to me.

I hope I do not modify ggml, but I have no idea to deal with this problem. New ggml support CoreML, maybe use it without thinking how to make work-around with metal.

The situation may change with the next ggml update in rwkv.cpp, but this is unlikely.

I know your concern, so I stop it in this moment. It need to be consider a long term progress.

RWKV / rwkv.cpp

Draft: support apple metal framework to compute graph #137

134