Open kerolos opened 1 year ago
The pipeline was developed based on existing Kaldi scripts as you mentioned above, but with a lot of bug fixes and ad-hoc modifications. However we have no near plan to open source these tools, coz it may require non-trivial efforts to clean up & generalize the code.
I would like, if it is possible, what are the procedures used to filter the original dataset, for example; from YouTube. Is there any script to you recommended being used for filter and cleanup?
I have used a Kaldi cleanup script /egs/wsj/s5/steps/cleanup/: A) GMM (clean_and_segment_data.sh - find_badd_utts.sh). "Not worked perfect for me, especially if there are in systematic error in the dataset" B) NNET (clean_and_segment_data_nnet3.sh - find_badd_utts_nnet3.sh). "It depends on the pretrained model, which is not good in my case"
You mentioned in the paper in section 3 Gigaspeech creation pipeline part 3.2 ,3.3 ,and 3.4 ; the step to take that. But I would like to know if you used different script than Kaldi, or what had been modified to the original script "cleanup"from Kaldi ?
Thanks in advance, I really appreciate any support.