Open fann1993814 opened 9 months ago
Hi @fann1993814, thanks for implementing this!
I've tested this with model: RWKV-5-World 7B (RWKV-5-World-7B-v2-OnlyForTest_49%_trained-20231114-ctx4096-Q4_0) I've converted it to ggml FP16 (FP32 too) and then quantized to Q4_0.
Apple M2 16GB (8 cores)
I'm getting the following error:
rwkv.cpp % python python/chat_with_bot.py RWKV-5-World-7B-v2-OnlyForTest_49%_trained-20231114-ctx4096-Q4_0.bin
System info: AVX=0 AVX2=0 AVX512=0 FMA=0 NEON=1 ARM_FMA=1 F16C=0 FP16_VA=1 WASM_SIMD=0 BLAS=1 SSE3=0 VSX=0
Loading RWKV model
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: loading '/rwkv.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x15732b230 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x16083e7f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x16083e9e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x16083ebd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x16083edc0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x16083efb0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x16083f1a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x16083f390 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x16083f580 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x16083fb80 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x160840160 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x1608408b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x160840ee0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x160841510 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x160841b40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x160842170 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x1608427a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x160842dd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x160843400 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x160843ba0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x1608441d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x160844800 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x160844e40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x157209fb0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x157235030 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x160845500 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x160845e00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x1608467e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x157f66db0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x157f675f0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x157f67e20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x160846da0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x160847360 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x160847a40 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x160848120 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x160848800 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x160848f90 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x160849720 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x160849eb0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x16084a640 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x16084add0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x16084b560 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x16084bcf0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x16084c480 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x16084cc10 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x16084d3a0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope 0x16084d990 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x16084e4e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x16084ecf0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x16084f500 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x16084fd10 | th_max = 1024 | th_width = 32
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'weight_data ' buffer, size = 4805.00 MB, ( 4805.50 / 10922.67)
ggml_metal_add_buffer: allocated 'serial_computer_buffer' buffer, size = 273.22 MB, ( 5078.72 / 10922.67)
Loading World v20230424 tokenizer
Processing 178 prompt tokens, may take a while
ggml_metal_add_buffer: allocated 'sequential_computer_buffer' buffer, size = 2896.44 MB, ( 7975.16 / 10922.67)
ggml_metal_graph_compute_block_invoke: node 2, op = REPEAT not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
ggml_metal_graph_compute_block_invoke: node 2285, op = MAP_CUSTOM1 not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
ggml_metal_graph_compute_block_invoke: node 4571, op = MAP_CUSTOM1 not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
ggml_metal_graph_compute_block_invoke: node 6855, op = MAP_CUSTOM1 not implemented
GGML_ASSERT: /rwkv.cpp/ggml/src/ggml-metal.m:1265: false
zsh: abort python python/chat_with_bot.py
Any idea?
@fann1993814
Hi @fdagostino , this pr is not ready.
because I don't know how to implement the some operations into metal
.
Operation is just like tensors compute ADD
or SUB
...
Some of Operation are too difficult to implement (REPEAT, or SET, also include RWKV_v5_wkv...)
Maybe can need ask @saharNooby how to implement...?
Can't ggml
just not use Metal for simple operations like ADD, SUB, etc. -- the same way it is for GPU offloading?
I'm afraid I would not be able neither implement, nor help with implementing operations for Metal -- among other reasons, I don't have a Mac.
The situation may change with the next ggml
update in rwkv.cpp
, but this is unlikely.
Hi @saharNooby
Can't ggml just not use Metal for simple operations like ADD, SUB, etc. -- the same way it is for GPU offloading?
Metal
need to be implemented kernel functions, you can see those functions in src/ggml/ggml-metal.metal
It's different with cuda
(cublas), because it only focus on BLAS computing. (Maybe GEMM,...,..)
Metal
is not just BLAS, It support all the operations for General-Purpose computing.
A graph need to be translate ggml-operations to metal-related operations. Otherwise, metal runtime cannot understand how to do.
Implement some basic kernel functions is easy, like SUB/DIV/SQT
or some custom operationsSIGMOID/OneMinusX
.
However, REPEAT/SET/WKV_for_v5
are difficult to me.
I hope I do not modify ggml
, but I have no idea to deal with this problem.
New ggml
support CoreML
, maybe use it without thinking how to make work-around with metal
.
The situation may change with the next ggml update in rwkv.cpp, but this is unlikely.
I know your concern, so I stop it in this moment. It need to be consider a long term progress.
rwkv_graph
rwkv_eval
rwkv_operators
. (because many ggml_op cannot compute in metal, and I found the max function cannot be re-write from other ggml-functions for metal backend can be supported.)134