I've been training the standard recipe for mini librispeech dataset using tuning/run_tdnn_1k.sh script (which is linked to local/chain/run_tdnn.sh). I was using a 32-core cluster running Ubuntu 18.04 since I didn't have a GPU available. The only major modification I made on the scripts was to comment out the background decoding processes for intermediary models (like monophones) in run.sh, so the only decoding process left was for tri3b (SAT). The remaining portions of the script were executed "as it is".
The problem is that when I try to decode the example file dr_strangelove.mp3 I get just some kind of small random transcriptions that do not even reflect the size of the audio, as you can see in the screenshot below.
My model and ivector files are linked to the recipe's exp/ folder as follows:
Configuration files were kept similar to those downloaded by the script prepare-models.sh except for the MFCC config file, which was modified to match the mfcc_hires.conf used during training.
$ cat conf/mfcc.conf
--use-energy=false # only non-default option.
--sample-frequency=16000 # Switchboard is sampled at 8kHz # changed for mini librispeech
# config for high-resolution MFCC features, intended for neural network
# training
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--num-mel-bins=40 # similar to Google's setup.
--num-ceps=40 # there is no dimensionality reduction.
--low-freq=20 # low cutoff frequency for mel bins... this is high-bandwidth data, so
# there might be some information at the low end.
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)
Regarding the command line options for kaldinnet2onlinedecoder in transcribe-audio.sh, I just switched nnet-mode to 3 in order to enable nnet3 support, and set use-threaded-decoder=false (https://github.com/alumae/gst-kaldi-nnet2-online/issues/45).
Hi there.
I've been training the standard recipe for mini librispeech dataset using
tuning/run_tdnn_1k.sh
script (which is linked tolocal/chain/run_tdnn.sh
). I was using a 32-core cluster running Ubuntu 18.04 since I didn't have a GPU available. The only major modification I made on the scripts was to comment out the background decoding processes for intermediary models (like monophones) inrun.sh
, so the only decoding process left was fortri3b
(SAT). The remaining portions of the script were executed "as it is".The problem is that when I try to decode the example file
dr_strangelove.mp3
I get just some kind of small random transcriptions that do not even reflect the size of the audio, as you can see in the screenshot below.My model and ivector files are linked to the recipe's
exp/
folder as follows:Configuration files were kept similar to those downloaded by the script
prepare-models.sh
except for the MFCC config file, which was modified to match themfcc_hires.conf
used during training.Regarding the command line options for
kaldinnet2onlinedecoder
intranscribe-audio.sh
, I just switchednnet-mode
to 3 in order to enable nnet3 support, and setuse-threaded-decoder=false
(https://github.com/alumae/gst-kaldi-nnet2-online/issues/45).Any ideas on what am I possibly missing here?
Possibly related: https://github.com/alumae/gst-kaldi-nnet2-online/issues/83