PABannier / bark.cpp

Suno AI's Bark model in C/C++ for fast text-to-speech
MIT License
630 stars 48 forks source link

Extremely slow interference on google colab. #179

Closed crackedpotato007 closed 1 month ago

crackedpotato007 commented 1 month ago

I am using the google colab demo on a CPU runtime and its excruciatingly slow, it has been upwards of 20 minutes and the first step of semantic token generation just finished at 73%.

Ill attach the final output when its done.

crackedpotato007 commented 1 month ago

_                   _                           
| |                 | |                          
| |__    __ _  _ __ | | __     ___  _ __   _ __  
| '_ \  / _` || '__|| |/ /    / __|| '_ \ | '_ \ 
| |_) || (_| || |   |   <  _ | (__ | |_) || |_) |
|_.__/  \__,_||_|   |_|\_\(_) \___|| .__/ | .__/ 
                                   | |    | |    
                                   |_|    |_|    
encodec_load_model_weights: in_channels = 1
encodec_load_model_weights: hidden_dim  = 128
encodec_load_model_weights: n_filters   = 32
encodec_load_model_weights: kernel_size = 7
encodec_load_model_weights: res_kernel  = 3
encodec_load_model_weights: n_bins      = 1024
encodec_load_model_weights: bandwidth   = 24
encodec_load_model_weights: sample_rate = 24000
encodec_load_model_weights: ftype       = 1
encodec_load_model_weights: qntvr       = 0
encodec_load_model_weights: ggml tensor size    = 320 bytes
encodec_load_model_weights: backend buffer size =  54.36 MB
encodec_load_model_weights: using CPU backend
encodec_load_model_weights: model size =    44.36 MB
encodec_load_model: n_q = 32

bark_tokenize_input: prompt: 'this is an audio generated by bark.cpp'
bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 58966 20203 28169 20222 

Generating semantic tokens... 26%

bark_print_statistics:   sample time =    47.48 ms / 205 tokens
bark_print_statistics:  predict time = 399150.69 ms / 1947.08 ms per token
bark_print_statistics:    total time = 399241.41 ms

Generating coarse tokens... 100%

bark_print_statistics:   sample time =    22.94 ms / 612 tokens
bark_print_statistics:  predict time = 1246901.88 ms / 2037.42 ms per token
bark_print_statistics:    total time = 1246956.25 ms

Generating fine tokens... 100%

bark_print_statistics:   sample time =   112.96 ms / 6144 tokens
bark_print_statistics:  predict time = 83667.94 ms / 13.62 ms per token
bark_print_statistics:    total time = 83828.00 ms

encodec_eval: compute buffer size: 103.34 MB

write_wav_on_disk: Number of frames written = 97920.

main:     load time =     0.00 ms
main:     eval time = 1789810.75 ms
main:    total time = 1790277.75 ms```
PABannier commented 1 month ago

Hello @crackedpotato007 ! Thanks for reaching out. I tried on Google Colab and it is indeed very slow. This is mostly because the machines proposed with Colab are not very powerful and Bark is still very resource intensive.

Can you send me the specs of the machine? Are you using the free plan of Google Colab?

I can only advise to use Bark-small which is a smaller model and should allow for faster inference time.

crackedpotato007 commented 1 month ago

Hello, I tried it on my device with a I3-10110U which is 2C/4T.

Not powerful by today's standards, BUT i did notice, chanding the thread count from 4 to 1 cut the time from 29MS a token to 18MS a token for semantic token generation

PABannier commented 1 month ago

@crackedpotato007 This is surprising. I'm closing this issue as it's going fast locally. Feel free to open a new one on this repo or directly on ggml repo to share your observations regarding multi threading.