PABannier / bark.cpp

Suno AI's Bark model in C/C++ for fast text-to-speech
MIT License
635 stars 49 forks source link

FIX Bug in fine encoder #74

Closed PABannier closed 11 months ago

PABannier commented 11 months ago

This PR fixes a bug in the fine encoder. More precisely, when inspecting the attention maps from our fine encoder and the one from the original bark implementation, we find a significant discrepancy hinting at an implementation error.

Green-Sky commented 11 months ago

with this change it runs now, but the audio is garbled

$ bin/main -m ../models/bark_v0/
bark_model_load: loading model from '../models/bark_v0/'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_model_load: total model size  =  4170.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token =     4.80 MB
bark_forward_text_encoder:   sample time =    20.04 ms
bark_forward_text_encoder:  predict time =  8496.40 ms / 23.28 ms per token
bark_forward_text_encoder:    total time =  8591.71 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token =     8.51 MB
bark_forward_coarse_encoder:   sample time =     5.65 ms
bark_forward_coarse_encoder:  predict time = 48689.18 ms / 150.28 ms per token
bark_forward_coarse_encoder:    total time = 48761.99 ms

bark_forward_fine_encoder: .....

bark_forward_fine_encoder: mem per token =     0.33 MB
bark_forward_fine_encoder:   sample time =    82.52 ms
bark_forward_fine_encoder:  predict time =  4179.05 ms
bark_forward_fine_encoder:    total time =  4264.82 ms

bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec:  predict time =  1117.38 ms / 1117.38 ms per token
bark_forward_encodec:    total time =  1263.33 ms

Number of frames written = 51840.

main:     load time = 10230.52 ms
main:     eval time = 62911.57 ms
main:    total time = 73142.12 ms

output.zip

but you can still make out the 'this is an audio' , if you listen closely. I see this as an absolute win :D

PABannier commented 11 months ago

@Green-Sky this is WIP for now :) will ping you when it's fixed

Green-Sky commented 11 months ago

@PABannier it now runs for me :)

(timings are bad bc pc was in energy saving mode)

$ bin/main -m ../models/bark_v0/
bark_model_load: loading model from '../models/bark_v0/'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_model_load: total model size  =  4170.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token =     4.80 MB
bark_forward_text_encoder:   sample time =    29.63 ms
bark_forward_text_encoder:  predict time = 14591.61 ms / 39.98 ms per token
bark_forward_text_encoder:    total time = 14754.56 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token =     8.51 MB
bark_forward_coarse_encoder:   sample time =     9.39 ms
bark_forward_coarse_encoder:  predict time = 94391.66 ms / 291.33 ms per token
bark_forward_coarse_encoder:    total time = 94528.52 ms

bark_forward_fine_encoder: .....

bark_forward_fine_encoder: mem per token =     3.25 MB
bark_forward_fine_encoder:   sample time =   116.07 ms
bark_forward_fine_encoder:  predict time = 189043.48 ms
bark_forward_fine_encoder:    total time = 189223.77 ms

bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec:  predict time =  1607.51 ms / 1607.51 ms per token
bark_forward_encodec:    total time =  1817.91 ms

Number of frames written = 51840.

main:     load time =  9873.04 ms
main:     eval time = 300360.00 ms
main:    total time = 310233.09 ms

output.zip

PABannier commented 11 months ago

@Green-Sky Thanks for the update! I'd like to chat with you. I tried to reach via matrix but it does not work (Failed to fetch user). Do you have an email?

Green-Sky commented 11 months ago

I tried to reach via matrix but it does not work (Failed to fetch user).

o.o lemme fix that, that is actually very concerning

PABannier commented 11 months ago

@Green-Sky sure! tell me when it's fixed :)

Green-Sky commented 11 months ago

it's worse than i thought, thank you very much for telling me :heart: