Closed PABannier closed 11 months ago
with this change it runs now, but the audio is garbled
$ bin/main -m ../models/bark_v0/
bark_model_load: loading model from '../models/bark_v0/'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1701.69 MB
bark_model_load: reading bark vocab
bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1250.69 MB
bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 7
gpt_model_load: n_wtes = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1218.26 MB
bark_model_load: reading bark codec model
encodec_model_load: model size = 44.32 MB
bark_model_load: total model size = 4170.64 MB
bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................
bark_forward_text_encoder: mem per token = 4.80 MB
bark_forward_text_encoder: sample time = 20.04 ms
bark_forward_text_encoder: predict time = 8496.40 ms / 23.28 ms per token
bark_forward_text_encoder: total time = 8591.71 ms
bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................
bark_forward_coarse_encoder: mem per token = 8.51 MB
bark_forward_coarse_encoder: sample time = 5.65 ms
bark_forward_coarse_encoder: predict time = 48689.18 ms / 150.28 ms per token
bark_forward_coarse_encoder: total time = 48761.99 ms
bark_forward_fine_encoder: .....
bark_forward_fine_encoder: mem per token = 0.33 MB
bark_forward_fine_encoder: sample time = 82.52 ms
bark_forward_fine_encoder: predict time = 4179.05 ms
bark_forward_fine_encoder: total time = 4264.82 ms
bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec: predict time = 1117.38 ms / 1117.38 ms per token
bark_forward_encodec: total time = 1263.33 ms
Number of frames written = 51840.
main: load time = 10230.52 ms
main: eval time = 62911.57 ms
main: total time = 73142.12 ms
but you can still make out the 'this is an audio' , if you listen closely. I see this as an absolute win :D
@Green-Sky this is WIP for now :) will ping you when it's fixed
@PABannier it now runs for me :)
(timings are bad bc pc was in energy saving mode)
$ bin/main -m ../models/bark_v0/
bark_model_load: loading model from '../models/bark_v0/'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1701.69 MB
bark_model_load: reading bark vocab
bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1250.69 MB
bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 7
gpt_model_load: n_wtes = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1218.26 MB
bark_model_load: reading bark codec model
encodec_model_load: model size = 44.32 MB
bark_model_load: total model size = 4170.64 MB
bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................
bark_forward_text_encoder: mem per token = 4.80 MB
bark_forward_text_encoder: sample time = 29.63 ms
bark_forward_text_encoder: predict time = 14591.61 ms / 39.98 ms per token
bark_forward_text_encoder: total time = 14754.56 ms
bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................
bark_forward_coarse_encoder: mem per token = 8.51 MB
bark_forward_coarse_encoder: sample time = 9.39 ms
bark_forward_coarse_encoder: predict time = 94391.66 ms / 291.33 ms per token
bark_forward_coarse_encoder: total time = 94528.52 ms
bark_forward_fine_encoder: .....
bark_forward_fine_encoder: mem per token = 3.25 MB
bark_forward_fine_encoder: sample time = 116.07 ms
bark_forward_fine_encoder: predict time = 189043.48 ms
bark_forward_fine_encoder: total time = 189223.77 ms
bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec: predict time = 1607.51 ms / 1607.51 ms per token
bark_forward_encodec: total time = 1817.91 ms
Number of frames written = 51840.
main: load time = 9873.04 ms
main: eval time = 300360.00 ms
main: total time = 310233.09 ms
@Green-Sky Thanks for the update! I'd like to chat with you. I tried to reach via matrix but it does not work (Failed to fetch user
). Do you have an email?
I tried to reach via matrix but it does not work (Failed to fetch user).
o.o lemme fix that, that is actually very concerning
@Green-Sky sure! tell me when it's fixed :)
it's worse than i thought, thank you very much for telling me :heart:
This PR fixes a bug in the fine encoder. More precisely, when inspecting the attention maps from our fine encoder and the one from the original bark implementation, we find a significant discrepancy hinting at an implementation error.