Quantize doesn't seem to work for codec model

PABannier / bark.cpp

Suno AI's Bark model in C/C++ for fast text-to-speech

MIT License

692 stars 55 forks source link

Quantize doesn't seem to work for codec model #88

Closed jzeiber closed 5 months ago

jzeiber commented 1 year ago

The text, coarse, and fine models are converted successfully, but the codec model always results in a 0 byte output. After a quick look, it seems the header in the codec model may be slightly different than the other models and it can't read the correct ftype from the file because the offsets are wrong.

Additionally, running the models as f32 or f16 produces very similar output for the same prompt/seed. Running the text, coarse, and fine models quantized at q8_0 produces an entirely different output for the same prompt/seed.

PABannier commented 1 year ago

Hi @jzeiber ! You're right it's not working for the codec model. I made a mistake in the README file, by including a command to quantize the codec model. Some functions in GGML do not support yet quantized tensors. Fortunately, the codec part is fast compared to the forward pass of the GPT encoders, which need to be run several hundreds times to generate tokens.

I'm changing the README file.

SammyBravo commented 1 year ago

Can we include some inference times on the README file for the quantized versions?

BarfingLemurs commented 1 year ago

~~@SammyBravo currently I experience a slowdown using q4 compare to running in fp16. Additionally, it seems during inference the same amount of memory is required.~~

PABannier commented 1 year ago

@BarfingLemurs This is not what I am observing.. I'm currently running a set of benchmarks. I'll update the README with detailed instructions for quantization (cc @SammyBravo ). What are the specs of your machine? Which prompt did you use? How much memory is required for the prompt you used?

BarfingLemurs commented 1 year ago

@PABannier Sorry, I misreported.