6. Gather more multi-lingual data

collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.

https://collabora.github.io/WhisperSpeech/

MIT License

3.75k stars 202 forks source link

6. Gather more multi-lingual data #11

Open jpc opened 1 year ago

jpc commented 1 year ago

Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.

faceair commented 7 months ago

Approximately 10,000 hours of Chinese audio recordings are available here. https://github.com/wenet-e2e/WenetSpeech

jpc commented 7 months ago

I think we need native speakers to ensure high quality material and build the best global open source TTS system.

I am thinking of setting up a common format and some docs to help people prepare, validate and upload multilingual speech data to Huggingface to include into WhisperSpeech base model training.

mush42 commented 7 months ago

Native Arabic speaker here. Just ping me once you're ready.

fakerybakery commented 6 months ago

Is this affiliated with Open Empathetic?