Pretrain dataset - Githubissues

Standard-Intelligence / hertz-dev

first base model for full-duplex conversational audio

https://si.inc

Apache License 2.0

763 stars 48 forks source link

Pretrain dataset #1

Closed indiejoseph closed 7 hours ago

indiejoseph commented 1 day ago

Where can I find the detail of the pretraining? eg: languages and hours in the dataset

laubonghaudoi commented 1 day ago

I have the same question. How many hours of training data does it use? What's the domain and language distributions of the training data? Is it English-only or it contains other languages as well? Is it all from podcasts and YouTube and does it have audiobook / music as well?

ahmedosman2001 commented 17 hours ago

On this post they claim it was trained on 20 million unique hours of high-quality audio data. Also it seem to be trained on multilingual data. If we can get more info, that would be great

Source: This post

devanshrpandey commented 7 hours ago

Yes, it was trained on 20m hours of general speech data, about 50% English. Most common non-English languages are Spanish and Chinese. The audio data is generally representative of public high-quality audio data.