alumae / gst-kaldi-nnet2-online

GStreamer plugin around Kaldi's online neural network decoder
Apache License 2.0
185 stars 100 forks source link

Wrong transcription on mini_librispeech nnet3 trained model #96

Open cassiotbatista opened 4 years ago

cassiotbatista commented 4 years ago

Hi there.

I've been training the standard recipe for mini librispeech dataset using tuning/run_tdnn_1k.sh script (which is linked to local/chain/run_tdnn.sh). I was using a 32-core cluster running Ubuntu 18.04 since I didn't have a GPU available. The only major modification I made on the scripts was to comment out the background decoding processes for intermediary models (like monophones) in run.sh, so the only decoding process left was for tri3b (SAT). The remaining portions of the script were executed "as it is".

The problem is that when I try to decode the example file dr_strangelove.mp3 I get just some kind of small random transcriptions that do not even reflect the size of the audio, as you can see in the screenshot below.

2020-04-15-152937_1096x336_scrot

My model and ivector files are linked to the recipe's exp/ folder as follows:

$ tree models-chaina/ ivector_extractor-chaina/
models-chaina/
├── final.mdl         -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tdnn1k_sp/final.mdl
├── HCLG.fst          -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tree_sp/graph_tgsmall/HCLG.fst
├── phones.txt        -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tdnn1k_sp/phones.txt
├── word_boundary.int -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tree_sp/graph_tgsmall/phones/word_boundary.int
└── words.txt         -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tree_sp/graph_tgsmall/words.txt
ivector_extractor-chaina/
├── final.dubm        -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/final.dubm
├── final.ie          -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/final.ie
├── final.mat         -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/final.mat
└── global_cmvn.stats -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/global_cmvn.stats

Configuration files were kept similar to those downloaded by the script prepare-models.sh except for the MFCC config file, which was modified to match the mfcc_hires.conf used during training.

$ cat conf/mfcc.conf
--use-energy=false   # only non-default option.
--sample-frequency=16000 #  Switchboard is sampled at 8kHz # changed for mini librispeech

# config for high-resolution MFCC features, intended for neural network
# training
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--num-mel-bins=40     # similar to Google's setup.
--num-ceps=40     # there is no dimensionality reduction.
--low-freq=20     # low cutoff frequency for mel bins... this is high-bandwidth data, so
                  # there might be some information at the low end.
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)

Regarding the command line options for kaldinnet2onlinedecoder in transcribe-audio.sh, I just switched nnet-mode to 3 in order to enable nnet3 support, and set use-threaded-decoder=false (https://github.com/alumae/gst-kaldi-nnet2-online/issues/45).

Any ideas on what am I possibly missing here?

Possibly related: https://github.com/alumae/gst-kaldi-nnet2-online/issues/83