[Feature] Discussions about the choice of VQ

wehos commented 8 months ago

Hi,

I'm a PhD candidate and a researcher in Computer Science. I'm deeply impressed by your series of work. Tbh, I'm not an expert of audio processing. I read relevant papers in the field and came up with a naive question about this project, where I believe "vector quantization" will become a major bottleneck for this framework in the future.

Is your feature request related to a problem? Please describe.

One major disadvantage of autoregressive models and diffusion models is their inference speed. This project adopts an autogressive model as a backbone (i.e., LLaMa), which may lead to significant latency in real-time inference scenarios. Meanwhile, streaming is an important aspect of reducing latency and enhancing user experience.

Here comes a question. As far as I know, the vector-quantization technique is based on quantizing the latent representation of each token and leveraging a transformer decoder to reconstruct it considering the representations from the whole sample. This means, when the GPT backbone generates semantics, the VQ-decoder is not able to get to work, until the generation of the whole sample is finished. This hinders the potential of streaming for the whole framework. In the meanwhile, the latest multimodal foundation models (sora from OpenAI, BASE TTS from Amazon) also rely on additional encoders and decoders, but not necessarily a VQ model. (sora used VAE, BASE TTS used CNN decoder).

Describe the solution you'd like

Replacing VQ-based decoder with end-to-end convolutional decoder. Please refer to this paper.

Describe alternatives you've considered

Replacing VQGAN with non-VQ pre-trained encoder/decoders.

Additional Context

Overall, I highly recommend the BASE TTS paper from Amazon. Meanwhile, I am very interested and happy to join the development of this project. If you would like to connect, please contact me through any of my contacts (available in my github profile). Thank you.

Best, Hongzhi

leng-yue commented 8 months ago

We're excited to share that we've been exploring various VQ solutions and have selected a promising option, although the related code isn't available yet. This solution processes at 20 tokens per second, which is significantly lower compared to DAC and other similar works based on Hubert/Wav2Vec. We're looking forward to further advancements in VQ decoders that can enhance speech quality. Also, I've sent you a connection request on LinkedIn, and I've set up a Discord server for us to connect more easily. Here's the invite link: https://discord.gg/Es5qTB9BcN.

wehos commented 8 months ago

Thanks for the information! Happy to get on board with discord!

wehos commented 8 months ago

Reopen the issue to keep the discord invitation posted. Feel free to close it

leng-yue commented 8 months ago

Thanks. Added the link to the README.

JohnHerry commented 2 months ago

@wehos Is there any improments on this problem now? what is the final resolution? the diffusion get better result but compute-expensive, is the VQGAN enough for good quality even for zero-shot unseen speakers?

wehos commented 2 months ago

@wehos Is there any improments on this problem now? what is the final resolution? the diffusion get better result but compute-expensive, is the VQGAN enough for good quality even for zero-shot unseen speakers?

As far as know the VQGAN did. The quality depends on many factors, including the number and size of code books, and the number of codes (a.k.a, tokens) per second.

JohnHerry commented 2 months ago

@wehos Is there any improments on this problem now? what is the final resolution? the diffusion get better result but compute-expensive, is the VQGAN enough for good quality even for zero-shot unseen speakers?

As far as know the VQGAN did. The quality depends on many factors, including the number and size of code books, and the number of codes (a.k.a, tokens) per second.

Thank you very much, can you share those number configs ?

fishaudio / fish-speech