Question About Chain-of-Modality Instruction Dataset

0nutation / SpeechGPT

SpeechGPT Series: Speech Large Language Models

https://0nutation.github.io/SpeechGPT.github.io/

Apache License 2.0

1.04k stars 64 forks source link

Question About Chain-of-Modality Instruction Dataset #29

Closed sophiefy closed 2 months ago

sophiefy commented 2 months ago

Thank you for sharing this great work. The paper says when building Chain-of-Modality instruction dataset, you trained an extra text-to-unit model. If my understanding is right, it functions like a TTS model. Is it possible that I use an off-the-shelf TTS API to generate speech clips first and then use HuBERT to extract discrete units? I think this way would be easier and more stable.

0nutation commented 2 months ago

Yes, your proposed method will generate speech instructions with higher quality. Actually we do so in our AnyGPT paper and the dataset is released at https://huggingface.co/datasets/fnlp/AnyInstruct/tree/main/speech_conv.

sophiefy commented 2 months ago

Thanks for your reply! I'll read AnyGPT paper then.