Open r666ay opened 2 months ago
yes, but whisper is more asr related, not used in tts scenario
yes, but whisper is more asr related, not used in tts scenario
WhisperSpeech (https://github.com/collabora/WhisperSpeech/issues/3) initially quantize Whisper encoder embeddings into semantic tokens, and then train the text-to-semantic model and the semantic-to-acoustic model. WhisperSpeech system takes VQ-based semantic tokens as speech tokens in speech generation, which is similar with CosyVoice paper (https://arxiv.org/pdf/2407.05407). So the conclusion ( To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models ) may be need to correct. At last, thanks your great work on CosyVoice.
This issue is stale because it has been open for 30 days with no activity.
In the Cosyvoice paper (https://arxiv.org/pdf/2407.05407), the authors mentioned that To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models. However, WhisperSpeech (https://github.com/collabora/WhisperSpeech/issues/3) has used tokens quantized by VQ in the Whisper encoder, which is very similar to CosyVoice.