Closed mzarazov closed 7 years ago
Can you paste your congif yaml file?
Yeah
use-nnet2: True
decoder:
nnet-mode: 3
nnet-batch-size: 128
model : mdl_lambda19/final.mdl
word-syms : mdl_lambda19/words.txt
phone-syms : mdl_lambda19/phones.txt
fst : mdl_lambda19/HCLG.fst
mfcc-config : mdl_lambda19/conf/mfcc.conf
ivector-extraction-config : mdl_lambda19/conf/ivector_extractor.conf
ivector-silence-weighting-silence-weight: 1.0
max-active: 10000
beam: 13.0
lattice-beam: 6.0
acoustic-scale: 1.0
do-endpointing : true
endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20"
max-state-duration: 40
frame-subsampling-factor: 3
num-nbest: 10
out-dir: tmp
use-vad: False
silence-timeout: 10
post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'
full-post-processor: ./abbrev_punct_post_processor.py
logging:
version : 1
disable_existing_loggers: False
formatters:
simpleFormater:
format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s'
datefmt: '%Y-%m-%d %H:%M:%S'
handlers:
console:
class: logging.StreamHandler
formatter: simpleFormater
level: ERROR
root:
level: ERROR
handlers: [console]
And here's an example script I'm using to transcribe audio with raw gst-kaldi-nnet2-online
gstreamer plugin
#!/bin/bash
if [ $# != 1 ]; then
echo "Usage: transcribe-audio.sh <audio>"
echo "e.g.: transcribe-audio.sh dr_strangelove.mp3"
exit 1;
fi
! GST_PLUGIN_PATH=../src gst-inspect-1.0 kaldinnet2onlinedecoder > /dev/null 2>&1 && echo "Compile the plugin in ../src first" && exit 1;
if [ ! -f HCLG.fst ]; then
echo "Missing decoding graph"
exit 1;
fi
audio=$1
GST_DEBUG="kaldinnet2onlinedecoder:4" GST_PLUGIN_PATH=../src gst-launch-1.0 -q filesrc location=$audio ! decodebin ! audioconvert ! audioresample ! \
kaldinnet2onlinedecoder \
use-threaded-decoder=false \
nnet-mode=3 \
model=final.mdl \
fst=HCLG.fst \
word-syms=words.txt \
phone-syms=phones.txt \
feature-type=mfcc \
mfcc-config=conf/mfcc.conf \
acoustic-scale=1.0 \
ivector-extraction-config=conf/ivector_extractor.conf \
frame-subsampling-factor=3 \
nnet-batch-size=128 \
max-active=10000 \
beam=13.0 \
lattice-beam=6.0 \
do-endpointing=true \
endpoint-silence-phones="1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20" \
ivector-silence-weighting-silence-weight=1.0 \
max-state-duration=40 \
! filesink location=/dev/stdout buffer-mode=2
Does this happen consistently across many runs with the same file and same settings?
Yeah, I ran all segments in eval2000 and it is consistently truncating last 1-2 words, increasing the number of deletions when compared to the reference, and thus increasing WER from 12.3% to 24.6%.
To be more specific, number of sentences with at least one deletion went up from 29% to 90%
Are you sure you are using the last "final" result from the kaldi-gstreamer-server? I guess the last word of the utterance often included only in the final result, not in the intermediate results. It's very likely that you are doing the right thing, I just want to be sure.
I clarify my question: how do you get the result from kaldi-gstreamer-server? Using client.py?
Great pointer, while I'm using the final result, I'm pretty sure something in my modified gst-kaldi-nnet2-online code is messing it up and eating up the JSON. Weirdly it displays full thing in transcribe.sh script output. This must be my doing 🥇 I'll close this issue for now, since it's not a problem in kaldi-gstreamer. Thank you for your help!
P.S. The problem was as follows -- I was trying to extract word timing information, and a bug creeped in -- since there is an <eps>
label in the beginning, the last word was trimmed from word hypotheses. Silly stuff.
My workers tend to miss last word in an audio when streaming a file via the client.py
For example, when I take the following segment from eval2000
using
gst-kaldi-nnet2-online
the hypothesis isHowever, using
kaldi-gstreamer
with exactly same settings, the hypothesis isIs this expected? Or am I doing something wrong?
Thanks,