SpeechColab / GigaSpeech

Large, modern dataset for speech recognition
Apache License 2.0
649 stars 62 forks source link

Inconsistency in local/gigaspeech_data_prep.sh #119

Closed leohuang2013 closed 2 years ago

leohuang2013 commented 2 years ago

local/gigaspeech_data_prep.sh calls utils/gigaspeech_download.sh, which does not exist in Gigaspeech Repo. It should be utils/download_gigaspeech.sh $gigaspeech

 35 if [ $stage -le 0 ]; then
 36   echo "======GigaSpeech Download START | current time : `date +%Y-%m-%d-%T`==="
 37   pushd $gigaspeech_repo
 38   utils/gigaspeech_download.sh $gigaspeech_root || exit 1

and in run.sh

 52 if [ $stage -le 1 ]; then
 53   echo "======Prepare Dictionary START | current time : `date +%Y-%m-%d-%T`===="
 54   [ ! -f $g2p_model ] && echo "$0: Cannot find G2P model $g2p_model" && exit 1
 55   local/prepare_dict.sh \
 56     --cmd "$train_cmd" --nj $train_nj \
 57     $g2p_model data/$train_combined $dict_dir || exit 1;
 58   echo "======Prepare Dictionary END | current time : `date +%Y-%m-%d-%T`======"
 59 fi

it checks G2P model, which is supposed to be downloaded in utils/download_gigaspeech.sh when flag --with-dict is provided as 'true', by default it is 'false', hence it won't download G2P model. To solve this problem, need pass parameter '--with-dict true' in invoking utils/download_gigaspeech.sh.

Final modification for downloading gigaspeech would be

 35 if [ $stage -le 0 ]; then
 36   echo "======GigaSpeech Download START | current time : `date +%Y-%m-%d-%T`==="
 37   pushd $gigaspeech_repo
 38   utils/download_gigaspeech.sh --with-dict true $gigaspeech_root || exit 1

Is above right, or did I miss something?

dophist commented 2 years ago

Hi Liyi, Yes it should be. Although I haven't checked the Kaldi recipe for quite a while, there are some facts that might be helpful to you as I recalled:

  1. utils/download_gigaspeech.sh was once named as utils/gigaspeech_download.sh, they are the same thing. Kaldi's recipe might be out-of-synced for this renaming.

  2. If the dictionary is indeed needed(this is typical case in Kaldi's hybrid systems), you can feed the --with-dict option to get it.

leohuang2013 commented 2 years ago

Thanks Jiayu for your quick reply. Should we take action for this issue, like create pull request or something else to facilitate trying gigaspeech by others.

dophist commented 2 years ago

Yes I believe creating a PR to fix the name inconsistency in Kaldi's gigaspeech recipe, will definitely help other users and save their time, that would be great!