Closed sophiefy closed 2 months ago
Yes, your proposed method will generate speech instructions with higher quality. Actually we do so in our AnyGPT paper and the dataset is released at https://huggingface.co/datasets/fnlp/AnyInstruct/tree/main/speech_conv.
Thanks for your reply! I'll read AnyGPT paper then.
Thank you for sharing this great work. The paper says when building Chain-of-Modality instruction dataset, you trained an extra text-to-unit model. If my understanding is right, it functions like a TTS model. Is it possible that I use an off-the-shelf TTS API to generate speech clips first and then use HuBERT to extract discrete units? I think this way would be easier and more stable.