alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 341 forks source link

Truncated transcription #76

Closed mzarazov closed 7 years ago

mzarazov commented 7 years ago

My workers tend to miss last word in an audio when streaming a file via the client.py

For example, when I take the following segment from eval2000

$KALDI_ROOT/egs/fisher_swbd/s5/../../..//tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 2 $HUB5E/english/sw_4910.sph | sox -t wav - sw_4910-B_trimmed.wav trim '295.81' '=299.01'

using gst-kaldi-nnet2-online the hypothesis is

i i i'm old enough i'm over fifty

However, using kaldi-gstreamer with exactly same settings, the hypothesis is

i i i'm old enough i'm over

Is this expected? Or am I doing something wrong?

Thanks,

alumae commented 7 years ago

Can you paste your congif yaml file?

mzarazov commented 7 years ago

Yeah

use-nnet2: True
decoder:
    nnet-mode: 3
    nnet-batch-size: 128
    model : mdl_lambda19/final.mdl
    word-syms : mdl_lambda19/words.txt
    phone-syms : mdl_lambda19/phones.txt
    fst : mdl_lambda19/HCLG.fst
    mfcc-config : mdl_lambda19/conf/mfcc.conf
    ivector-extraction-config : mdl_lambda19/conf/ivector_extractor.conf
    ivector-silence-weighting-silence-weight: 1.0
    max-active: 10000
    beam: 13.0
    lattice-beam: 6.0
    acoustic-scale: 1.0
    do-endpointing : true
    endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20"
    max-state-duration: 40
    frame-subsampling-factor: 3
    num-nbest: 10
out-dir: tmp

use-vad: False
silence-timeout: 10

post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'
full-post-processor: ./abbrev_punct_post_processor.py

logging:
    version : 1
    disable_existing_loggers: False
    formatters:
        simpleFormater:
            format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s'
            datefmt: '%Y-%m-%d %H:%M:%S'
    handlers:
        console:
            class: logging.StreamHandler
            formatter: simpleFormater
            level: ERROR
    root:
        level: ERROR
        handlers: [console]
mzarazov commented 7 years ago

And here's an example script I'm using to transcribe audio with raw gst-kaldi-nnet2-online gstreamer plugin

#!/bin/bash

if [ $# != 1 ]; then
    echo "Usage: transcribe-audio.sh <audio>"
    echo "e.g.: transcribe-audio.sh dr_strangelove.mp3"
    exit 1;
fi

! GST_PLUGIN_PATH=../src gst-inspect-1.0 kaldinnet2onlinedecoder > /dev/null 2>&1 && echo "Compile the plugin in ../src first" && exit 1;

if [ ! -f HCLG.fst ]; then
    echo "Missing decoding graph"
    exit 1;
fi

audio=$1

GST_DEBUG="kaldinnet2onlinedecoder:4" GST_PLUGIN_PATH=../src gst-launch-1.0 -q filesrc location=$audio ! decodebin ! audioconvert ! audioresample ! \
  kaldinnet2onlinedecoder \
  use-threaded-decoder=false \
  nnet-mode=3 \
  model=final.mdl \
  fst=HCLG.fst \
  word-syms=words.txt \
  phone-syms=phones.txt \
  feature-type=mfcc \
  mfcc-config=conf/mfcc.conf \
  acoustic-scale=1.0 \
  ivector-extraction-config=conf/ivector_extractor.conf \
  frame-subsampling-factor=3 \
  nnet-batch-size=128 \
  max-active=10000 \
  beam=13.0 \
  lattice-beam=6.0 \
  do-endpointing=true \
  endpoint-silence-phones="1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20" \
  ivector-silence-weighting-silence-weight=1.0 \
  max-state-duration=40 \
! filesink location=/dev/stdout buffer-mode=2
alumae commented 7 years ago

Does this happen consistently across many runs with the same file and same settings?

mzarazov commented 7 years ago

Yeah, I ran all segments in eval2000 and it is consistently truncating last 1-2 words, increasing the number of deletions when compared to the reference, and thus increasing WER from 12.3% to 24.6%.

mzarazov commented 7 years ago

To be more specific, number of sentences with at least one deletion went up from 29% to 90%

alumae commented 7 years ago

Are you sure you are using the last "final" result from the kaldi-gstreamer-server? I guess the last word of the utterance often included only in the final result, not in the intermediate results. It's very likely that you are doing the right thing, I just want to be sure.

alumae commented 7 years ago

I clarify my question: how do you get the result from kaldi-gstreamer-server? Using client.py?

mzarazov commented 7 years ago

Great pointer, while I'm using the final result, I'm pretty sure something in my modified gst-kaldi-nnet2-online code is messing it up and eating up the JSON. Weirdly it displays full thing in transcribe.sh script output. This must be my doing 🥇 I'll close this issue for now, since it's not a problem in kaldi-gstreamer. Thank you for your help!

P.S. The problem was as follows -- I was trying to extract word timing information, and a bug creeped in -- since there is an <eps> label in the beginning, the last word was trimmed from word hypotheses. Silly stuff.