SpeechColab / GigaSpeech2

An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawling, transcription and refinement
Apache License 2.0
107 stars 5 forks source link

Dataset preparation in other language #8

Closed Tortoise17 closed 3 weeks ago

Tortoise17 commented 3 weeks ago

Giving the crawler other channels path, is it possible to generate dataset of my own in other language?

yfyeung commented 3 weeks ago

@Tortoise17 Of course.

You can follow https://github.com/SpeechColab/GigaSpeech2/blob/main/pipeline/crawler/README.md.

Make sure you assign the right language id. For example, the ISO 639-1 language code of Thai is th and the ISO 639-2 language code of Thai is tha.

Tortoise17 commented 3 weeks ago

Great, and the format than can be used for training audio LDM2 for like speech model? I guess.

Tortoise17 commented 3 weeks ago

Just one question, considering reference estimate, how much time it took to generate a prepared dataset 30,000 hours and on which GPU you have prepared the dataset?

yfyeung commented 3 weeks ago

Great, and the format than can be used for training audio LDM2 for like speech model? I guess.

@Tortoise17 Yes. The original file type is probably .webm and it will be converted into .wav.

yfyeung commented 3 weeks ago

Just one question, considering reference estimate, how much time it took to generate a prepared dataset 30,000 hours and on which GPU you have prepared the dataset?

@Tortoise17 You can refer:

截屏2024-09-08 18 42 30

We suggest using faster-whisper and multi-gpu to parallelize the transcription.