PABannier / bark.cpp

Suno AI's Bark model in C/C++ for fast text-to-speech
MIT License
692 stars 55 forks source link

bark_forward_fine_encoder tried to allocate 30GB of memory during forward pass. #97

Closed qnixsynapse closed 1 year ago

qnixsynapse commented 1 year ago

As explained before in one of the issues, during a forward pass, the bark_forward_fine_encoder tried to allocate 30GB of memory.

The console log looks something like this:

./main -m ./ggml_weights -p "this is an audio" 
bark_model_load: loading model from './ggml_weights'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_model_load: total model size  =  4170.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595 
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token =     4.80 MB
bark_forward_text_encoder:   sample time =    13.86 ms
bark_forward_text_encoder:  predict time =  6651.94 ms / 18.22 ms per token
bark_forward_text_encoder:    total time =  6737.75 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token =     8.51 MB
bark_forward_coarse_encoder:   sample time =     3.54 ms
bark_forward_coarse_encoder:  predict time = 31155.62 ms / 96.16 ms per token
bark_forward_coarse_encoder:    total time = 31228.26 ms

fine_gpt_eval: failed to allocate 31987885670 bytes
bark_forward_fine_encoder: ggml_aligned_malloc: insufficient memory (attempted to allocate 30506.03 MB)
GGML_ASSERT: ggml.c:4408: ctx->mem_buffer != NULL
zsh: killed     ./main -m ./ggml_weights -p "this is an audio"

So far I am unable to track the cause. But will keep trying.

qnixsynapse commented 1 year ago

The console logs with q4_0 weights:

./main -t 1 -m quantized_weights/ -p "this is an audio"
bark_model_load: loading model from 'quantized_weights/'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 2002
gpt_model_load: qntvr       = 2
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 436.08 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =   242.90 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 2002
gpt_model_load: qntvr       = 2
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 372.66 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =   179.48 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ftype       = 2002
gpt_model_load: qntvr       = 2
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 368.07 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =   175.08 MB

bark_model_load: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_model_load: total model size  =   597.47 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595 
bark_forward_text_encoder: .............................................................................................................................................

bark_forward_text_encoder: mem per token =     4.80 MB
bark_forward_text_encoder:   sample time =    10.56 ms
bark_forward_text_encoder:  predict time =  6200.19 ms / 15.54 ms per token
bark_forward_text_encoder:    total time =  6278.56 ms

bark_forward_coarse_encoder: .........................................................................................................................................................................................................................................................................................................................................................................................................................................

bark_forward_coarse_encoder: mem per token =    10.31 MB
bark_forward_coarse_encoder:   sample time =     4.15 ms
bark_forward_coarse_encoder:  predict time = 46163.62 ms / 108.37 ms per token
bark_forward_coarse_encoder:    total time = 46232.75 ms

fine_gpt_eval: failed to allocate 32030667571 bytes
bark_forward_fine_encoder: ggml_aligned_malloc: insufficient memory (attempted to allocate 30546.83 MB)
GGML_ASSERT: ggml.c:4408: ctx->mem_buffer != NULL
zsh: killed     ./main -t 1 -m quantized_weights/ -p "this is an audio"

I observed that even with quantized weights the coarse encoder takes some time and then the fine encoder tried to allocate very large memory.

PABannier commented 1 year ago

Thanks for spotting this bug @akarshanbiswas ! I think I found my mistake: here I should divide by n_codes which would divide by a factor of 8 the amount of required memory! Opening a PR to fix this issue.

qnixsynapse commented 1 year ago

Awesome. You are very much welcome!

PABannier commented 1 year ago

Closed with https://github.com/PABannier/bark.cpp/pull/99 .

qnixsynapse commented 1 year ago

Tested and it works now!!! Thank you so much!

qnixsynapse commented 1 year ago

Also a question. Why does the fine encoder allocated memory depends on the sequence length of the input?

Just got oom on my 16 GB PC by just increasing the length of the text input.

@PABannier