SpeechColab / GigaSpeech

Large, modern dataset for speech recognition
Apache License 2.0
649 stars 62 forks source link

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

Open kerolos opened 1 year ago

kerolos commented 1 year ago

I would like, if it is possible, what are the procedures used to filter the original dataset, for example; from YouTube. Is there any script to you recommended being used for filter and cleanup?

I have used a Kaldi cleanup script /egs/wsj/s5/steps/cleanup/: A) GMM (clean_and_segment_data.sh - find_badd_utts.sh). "Not worked perfect for me, especially if there are in systematic error in the dataset" B) NNET (clean_and_segment_data_nnet3.sh - find_badd_utts_nnet3.sh). "It depends on the pretrained model, which is not good in my case"

You mentioned in the paper in section 3 Gigaspeech creation pipeline part 3.2 ,3.3 ,and 3.4 ; the step to take that. But I would like to know if you used different script than Kaldi, or what had been modified to the original script "cleanup"from Kaldi ?

Thanks in advance, I really appreciate any support.

dophist commented 1 year ago

The pipeline was developed based on existing Kaldi scripts as you mentioned above, but with a lot of bug fixes and ad-hoc modifications. However we have no near plan to open source these tools, coz it may require non-trivial efforts to clean up & generalize the code.