Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook.

I would like, if it is possible, what are the procedures used to filter the original dataset, for example; from YouTube. Is there any script to you recommended being used for filter and cleanup?

I have used a Kaldi cleanup script /egs/wsj/s5/steps/cleanup/: A) GMM (clean_and_segment_data.sh - find_badd_utts.sh). "Not worked perfect for me, especially if there are in systematic error in the dataset" B) NNET (clean_and_segment_data_nnet3.sh - find_badd_utts_nnet3.sh). "It depends on the pretrained model, which is not good in my case"

You mentioned in the paper in section 3 Gigaspeech creation pipeline part 3.2 ,3.3 ,and 3.4 ; the step to take that. But I would like to know if you used different script than Kaldi, or what had been modified to the original script "cleanup"from Kaldi ?

Thanks in advance, I really appreciate any support.

SpeechColab / GigaSpeech

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132