FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
4.75k stars 480 forks source link

It is not the first attempt to involve supervised speech tokens into TTS models. #313

Open r666ay opened 3 weeks ago

r666ay commented 3 weeks ago

In the Cosyvoice paper (https://arxiv.org/pdf/2407.05407), the authors mentioned that To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models. However, WhisperSpeech (https://github.com/collabora/WhisperSpeech/issues/3) has used tokens quantized by VQ in the Whisper encoder, which is very similar to CosyVoice.

aluminumbox commented 2 weeks ago

yes, but whisper is more asr related, not used in tts scenario

r666ay commented 2 weeks ago

yes, but whisper is more asr related, not used in tts scenario

WhisperSpeech (https://github.com/collabora/WhisperSpeech/issues/3) initially quantize Whisper encoder embeddings into semantic tokens, and then train the text-to-semantic model and the semantic-to-acoustic model. WhisperSpeech system takes VQ-based semantic tokens as speech tokens in speech generation, which is similar with CosyVoice paper (https://arxiv.org/pdf/2407.05407). So the conclusion ( To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models ) may be need to correct. At last, thanks your great work on CosyVoice.