PABannier / bark.cpp

Suno AI's Bark model in C/C++ for fast text-to-speech
MIT License
633 stars 48 forks source link

What's the output length? #113

Open h3ndrik opened 9 months ago

h3ndrik commented 9 months ago

I think I remember reading bark generates 30s of audio at a time. Is that also true for bark.cpp?

I've tried letting it read some article and it crashed. Is that a length limitation or something else?

Also: Is there example code to make it read back a whole news article, a dialogue or anything useful?

PABannier commented 9 months ago

Hi! Can you give me the traceback when bark.cpp crashed? I assume it's an OOM error, that caused the kernel to kill the bark.cpp process. In particular, for 30s audio, the encodec's memory requirements grow with the sequence length, which make the computations untractable in practice for long audios. I'm currently refining the code to keep memory under control.

h3ndrik commented 9 months ago
user@linux:~/tmp/bark.cpp$ ./build/bin/main -t 2 -o "./output2.wav" -p "Matcha is finely ground powder of specially grown and processed green t
ea leaves traditionally consumed in East Asia, which is mostly produced in Japan today. The green tea plants used for matcha are shade-grown for thr
ee to four weeks before harvest; the stems and veins are removed during processing. During shaded growth, the plant Camellia sinensis produces more 
theanine and caffeine. The powdered form of matcha is consumed differently from tea leaves or tea bags, as it is suspended in a liquid, typically wa
ter or milk."                                                              
bark_load_model_from_file: loading model from './ggml_weights'
bark_load_model_from_file: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_load_model_from_file: reading bark vocab

bark_load_model_from_file: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_load_model_from_file: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_load_model_from_file: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_load_model_from_file: total model size  =  4170.64 MB

bark_tokenize_input: prompt: 'Matcha is finely ground powder of specially grown and processed green tea leaves traditionally consumed in East Asia, 
which is mostly produced in Japan today. The green tea plants used for matcha are shade-grown for three to four weeks before harvest; the stems and 
veins are removed during processing. During shaded growth, the plant Camellia sinensis produces more theanine and caffeine. The powdered form of mat
cha is consumed differently from tea leaves or tea bags, as it is suspended in a liquid, typically water or milk.'
bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 36199 20161 20172 23483 20502 26960 20562 72276 
bark_forward_text_encoder: .........................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
.................................................................................................................................

bark_print_statistics: mem per token =     4.80 MB
bark_print_statistics:   sample time =   172.53 ms / 696 tokens
bark_print_statistics:  predict time = 67473.14 ms / 96.81 ms per token
bark_print_statistics:    total time = 67666.94 ms

bark_forward_coarse_encoder: .......................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
....................................................................................................................................................
............................................

bark_print_statistics: mem per token =   134.99 MB
bark_print_statistics:   sample time =    41.46 ms / 2088 tokens
bark_print_statistics:  predict time = 974489.06 ms / 466.48 ms per token
bark_print_statistics:    total time = 974548.31 ms

bark_forward_fine_encoder: ...........double free or corruption (!prev)
Aborted
PABannier commented 2 months ago

The OOM error should be fixed with #139 . @h3ndrik Would you like to give it another try?

h3ndrik commented 2 months ago

Hmm sorry, at the moment I'm unable to test this. It always says:

bark_load_model: failed to load model weights from './ggml_weights/'

and bark_load_model_from_file: invalid model file [...] (bad magic)

Guess I somehow broke my development environment and need to fix that first. Feel free to close this issue if appropriate.

Green-Sky commented 2 months ago

The model files changed and got merged into one. You can grab uptodate files from here https://huggingface.co/Green-Sky/bark-ggml/tree/main

h3ndrik commented 2 months ago

You can grab uptodate files [...]

Hmm. I strictly followed the "Prepare data & Run" process from the README.md maybe I can debug that later...

I downloaded the updated model files. I need to specify the exact file with both -m and -em or it will either complain about (bad magic) again (just with the directory) or not find the encodec with the filename from the HF repo...

But, I'm happy to confirm: Now it works! It's excruciatingly slow on my machine, took 10 minutes to convert that text into a 12s audio file. And the fist half of the text is missing. It starts pretty much in the middle.

Edit: And this one fails:

 ./build/examples/main/main -t 2 -o "./output3.wav" -m ./ggml_weights/bark_weights-f16.bin -em ./ggml_weights/encodec_weights-f16.bin -p "Matcha is finely ground powder of specially grown and processed green tea leaves traditionally consumed in East Asia, which is mostly produced in Japan today. The green tea plants used for matcha are shade-grown for three to four weeks before harvest; the stems and veins are removed during processing. During shaded growth, the plant Camellia sinensis produces more theanine and caffeine. The powdered form of matcha is consumed differently from tea leaves or tea bags, as it is suspended in a liquid, typically water or milk. The traditional Japanese tea ceremony, typically known as chanoyu, centers on the preparation, serving and drinking of matcha as hot tea, and embodies a meditative spirituality. In modern times, matcha is also used to flavor and dye foods such as mochi and soba noodles, green tea ice cream, matcha lattes and a variety of Japanese wagashi confectionery."
bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 36199 20161 20172 23483 20502 26960 20562 72276 

Generating semantic tokens: [=============================================>     ] (90%)

bark_print_statistics:   sample time =   134.36 ms / 696 tokens
bark_print_statistics:  predict time = 48305.85 ms / 69.40 ms per token
bark_print_statistics:    total time = 48465.52 ms

Generating coarse tokens: [==================================================>] (100%)

bark_print_statistics:   sample time =    45.64 ms / 2088 tokens 
bark_print_statistics:  predict time = 632271.56 ms / 302.81 ms per token
bark_print_statistics:    total time = 632347.75 ms

Generating fine tokens: [==================================================>] (100%)free(): invalid next size (normal)
Aborted

Edit2: Longer prompt also fails with q8 and shorter prompt also fails with f16, works with q8 though.

PABannier commented 2 months ago

@h3ndrik Thanks for the feedback! Bark.cpp now supports bark-small (#151 ). I greatly simplified the instructions. For a prompt this long, this is what I would do to speed up the process:

  1. Use bark-small with mixed precision (--use-f16 when running the convert.py script)
  2. Pass -t 8 when running the main script to use n_threads (provided you have enough CPU threads on your machine)

i'd be super curious if you are able to generate this sentence faster