PABannier / bark.cpp

Suno AI's Bark model in C/C++ for fast text-to-speech
MIT License
630 stars 48 forks source link

Metal #143

Open zodiac1214 opened 2 months ago

zodiac1214 commented 2 months ago

I tried to

cmake -DGGML_METAL=ON  ..
cmake --build . --config Release

but it is still only using CPU instead of Mac GPU

PABannier commented 2 months ago

Hello @zodiac1214 !

You are right; there is a mistake in the implementation, which makes it impossible for now to run Bark.cpp on Metal. I'll fix it in the next few days.

ochafik commented 2 months ago

FWIW, tried to wire the -ngl params here and hit a wall with:

ggml_metal_graph_compute_block_invoke: error: node 5, op = SET not implemented

(a ggml sync might help?)

git remote add ochafik https://github.com/ochafik/bark.cpp
git fetch ochafik
rm -fR build && \
  cmake -B build . -DGGML_METAL=1 -DCMAKE_BUILD_TYPE=Release && \
  cmake --build build && \
  cp build/bin/ggml-metal.metal build/encodec.cpp/ggml/src

./build/examples/main/main -m ./models/bark/ggml_weights.bin -p "Test" -t 4 -o out2.wav
show full output ``` ggml_metal_init: loaded kernel_silu 0x13ff44a50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_relu 0x13ff44c80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu 0x13ff44eb0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max 0x13ff450e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_4 0x13ff45310 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf 0x13ff45540 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ff45770 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f32 0x13ff459a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f16 0x13ff45bd0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ff45e00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ff46030 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ff46260 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ff46490 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ff466c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ff468f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ff46b20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ff46d50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rms_norm 0x13ff46f80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_norm 0x13ff471b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f32_f32 0x13ff473e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f16_f32 0x13ff47610 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row 0x13ff47840 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4 0x140125080 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q4_0_f32 0x1401252b0 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q4_1_f32 0x1401254e0 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q8_0_f32 0x140125710 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q2_K_f32 0x140125940 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q3_K_f32 0x140125b70 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q4_K_f32 0x140125da0 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q5_K_f32 0x140125fd0 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q6_K_f32 0x140126200 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x140126430 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x140126660 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x140126890 | th_max = 704 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x140126ac0 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x140126cf0 | th_max = 704 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x140126f20 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x140127150 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x140127380 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x1401275b0 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x1401277e0 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_rope_f32 0x140127a10 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_f16 0x140127c40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_alibi_f32 0x140127e70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f16 0x1401280a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f32 0x1401282d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f16_f16 0x140128500 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_concat 0x140128730 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sqr 0x140128960 | th_max = 1024 | th_width = 32 ggml_metal_init: GPU name: Apple M1 Ultra ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 98304.00 MB ggml_metal_init: maxTransferRate = built-in GPU ggml_metal_add_buffer: allocated 'backend ' buffer, size = 54.36 MB, ( 2561.59 / 98304.00) encodec_load_model_weights: model size = 44.36 MB encodec_load_model: n_q = 32 # bctx->n_gpu_layers = 99 bark_tokenize_input: prompt: 'I really love using llama.cpp and its ecosystem. They make me happy' bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 10194 40229 26186 23430 57329 10167 10219 26635 ggml_metal_add_buffer: allocated 'backend ' buffer, size = 10.06 MB, ( 2571.66 / 98304.00) ggml_metal_graph_compute_block_invoke: error: node 5, op = SET not implemented GGML_ASSERT: /Users/ochafik/github/bark.cpp/encodec.cpp/ggml/src/ggml-metal.m:1428: false ```
PABannier commented 2 months ago

@ochafik Thanks for trying it!

Yes, we'll need to sync with the latest version of ggml. However, we'll have to implement additional operations to ggml and to write Metal kernels (e.g. sigmoid, pad_reflec_1).