Open r666ay opened 3 weeks ago
yes, but whisper is more asr related, not used in tts scenario
yes, but whisper is more asr related, not used in tts scenario
WhisperSpeech (https://github.com/collabora/WhisperSpeech/issues/3) initially quantize Whisper encoder embeddings into semantic tokens, and then train the text-to-semantic model and the semantic-to-acoustic model. WhisperSpeech system takes VQ-based semantic tokens as speech tokens in speech generation, which is similar with CosyVoice paper (https://arxiv.org/pdf/2407.05407). So the conclusion ( To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models ) may be need to correct. At last, thanks your great work on CosyVoice.
In the Cosyvoice paper (https://arxiv.org/pdf/2407.05407), the authors mentioned that To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models. However, WhisperSpeech (https://github.com/collabora/WhisperSpeech/issues/3) has used tokens quantized by VQ in the Whisper encoder, which is very similar to CosyVoice.