Closed indiejoseph closed 7 hours ago
I have the same question. How many hours of training data does it use? What's the domain and language distributions of the training data? Is it English-only or it contains other languages as well? Is it all from podcasts and YouTube and does it have audiobook / music as well?
On this post they claim it was trained on 20 million unique hours of high-quality audio data. Also it seem to be trained on multilingual data. If we can get more info, that would be great
Source: This post
Yes, it was trained on 20m hours of general speech data, about 50% English. Most common non-English languages are Spanish and Chinese. The audio data is generally representative of public high-quality audio data.
Where can I find the detail of the pretraining? eg: languages and hours in the dataset