FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
4.31k stars 423 forks source link

speech_tokenizer #121

Closed UestcJay closed 1 month ago

UestcJay commented 1 month ago

The form of speech_tokenizer in the open source model is onnx. If I want to train a speech_tokenizer myself, how should I do it? What can I refer to? Approximately how much data is needed?

aluminumbox commented 1 month ago

maybe @ZhihaoDU can answer this?

ZhihaoDU commented 1 month ago

The form of speech_tokenizer in the open source model is onnx. If I want to train a speech_tokenizer myself, how should I do it? What can I refer to? Approximately how much data is needed?

I recommend reading our paper first, you can find at https://arxiv.org/abs/2407.05407. For the data scale, you can use all the same-language data on your hand, since S3 tokenizer is trained in a supervised manner.

UestcJay commented 1 month ago

Thank you for your reply! The paper does not seem to use the universal residual vector quantizer but only one VQ. What is the reason for this design?

ZhihaoDU commented 1 month ago

Thank you for your reply! The paper does not seem to use the universal residual vector quantizer but only one VQ. What is the reason for this design?

One VQ is more friendly for LLM model with only-one predictor head. Unlike UniAudio, MusicGen or other RVQ-based codec models, the modeling and predicting process of LM in CosyVoice is much simpler. As you can see, with only one VQ, CosyVoice still works very well, generating natural and high quality speeches.

UestcJay commented 1 month ago

Thank you very much for your reply. I have another question. You are also the author of LauraGPT. It seems that the technical routes of the two are relatively similar, but in LauraGPT, the article claims that continuous input is better than discrete input. Why does Cosyvoice not continue to use this method? Instead of using a technical route to discretize the input? I recently read the paper and was more interested in this part.

ZhihaoDU commented 1 month ago

I believe your question is not true. In LauraGPT, what we mentioned is that for understanding tasks, continuous representations can achieve better results than discrete representations. For generative tasks, LauraGPT also uses RVQ-based discrete acoustic tokens. In CosyVoice, we use supervised semantic tokens because they carry relatively strong speech information, so for the language model, a single-codebook VQ is sufficient.

UestcJay commented 1 month ago

微信截图_20240716151149 微信截图_20240716151128 The audio input of LauraGPT is continuous, and the speech token output by LLM should go through RVQ, right? I mean that continuous representations for audio inputs is better than discrete, why cosyvoice can not be continuous representations for audio inputs?

ZhihaoDU commented 1 month ago

OK, I catch your question. As I already mentioned in the last reply. ASR, S2TT and SE can be treated as understanding tasks, thus we use continuous features as inputs. As you know, CosyVoice is a TTS model, right? For the TTS task, LauraGPT also use discrete tokens (the first group RVQ tokens). Then a Non-autoregressive model is used to recover the left token groups from RVQ in LauraGPT, while, in CosyVoice, flow matching takes the role of Non-autoregressive model and reconstruct Mel spectrum. So, in my opinion, CosyVoice has the same philosophy as LauraGPT, but upgrades the token type and non-autoregressive model.