alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 342 forks source link

Custom language model not working #186

Closed alx741 closed 5 years ago

alx741 commented 5 years ago

The problem

I've trained a custom GMM/HMM model with custom data for Spanish and have rather decent results down to ~12% WER. When trying to use it with kaldi-gstreamer-server though, I get terrible results (in the form of almost no result at all).

As test input speech audio I'm using a sample (sample.wav) that the model decodes perfectly (on the decoding phase of training) to "todos tienen el derecho a la educacion" with:

steps/decode.sh  --cmd "$decode_cmd" exp/tri2b/graph data/test exp/tri2b_mmi_b0.05/decode_test

By using the python client to decode on the worker however:

python2 kaldigstserver/client.py -u ws://localhost:8080/client/ws/speech -r 88200 ./sample.wav

The output I get when using the model in kaldi-gstreamer-server and sending the exact same wave sample is: "<UNK> un < u". So obviously something is going terribly wrong.

Note that I'm using -r 88200 because sample.wav is 44.1k, 16-bit (is this reasoning correct?), though I've tried changing this value and the result is the same.

What I'm doing

Note that before trying to use my custom model, I first tested kaldi-gstreamer-server with the provided test english model and data and it worked (and continue to work) perfectly. It only fails when using my model.

This is how I'm setting this up:

I first run the training and then copy the files from the model training to a model directory to be used by kaldi-gstreamer-server:

All the files are available here: https://drive.google.com/open?id=1aVxiWBl-hGN3JJCuJWa47aYJuloJD06u

cp exp/tri2b_mmi_b0.05/final.mdl       /media/kaldi_models/spanish
cp exp/tri2b/final.mat                        /media/kaldi_models/spanish
cp exp/tri2b/graph/words.txt               /media/kaldi_models/spanish
cp exp/tri2b/graph/HCLG.fst              /media/kaldi_models/spanish

Then I write a config file: (NOTE: I'm actually using docker-kaldi-gstreamer-server here)

timeout-decoder : 10
decoder:
   model:     /opt/models/spanish/final.mdl
   lda-mat:   /opt/models/spanish/final.mat
   word-syms: /opt/models/spanish/words.txt
   fst:       /opt/models/spanish/HCLG.fst
   silence-phones: "1:2:3:4:5"
   beam: 13.0
out-dir: tmp

use-vad: False
silence-timeout: 60

# Just a sample post-processor that appends "." to the hypothesis
# post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'
logging:
    version : 1
    disable_existing_loggers: False
    formatters:
        simpleFormater:
            format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s'
            datefmt: '%Y-%m-%d %H:%M:%S'
    handlers:
        console:
            class: logging.StreamHandler
            formatter: simpleFormater
            level: DEBUG
    root:
        level: DEBUG
        handlers: [console]

Run the container:

docker run -it -p 8080:80 -v /media/kaldi_models:/opt/models jcsilva/docker-kaldi-gstreamer-server:latest /bin/bash

Start a worker within the container:

 /opt/start.sh -y /opt/models/spanish_worker.yaml

Use the python client to get a result:

python2 kaldigstserver/client.py -u ws://localhost:8080/client/ws/speech -r 88200 ./sample.wav

Is there some particular requirement on how the model should be trained in order for it to be suitable for kaldi-gstreamer-server? If that's not the case, what could be the problem here?

alx741 commented 5 years ago

SOLVED

I was training the model using audio data with a 44.1khz sampling rate which apparently makes the decoder fail.

I fixed it by converting all audio data to 16khz sampling rate and re-training the model, then just used this new model with the exact same setup I described in the first comment above and it worked perfectly.

Note that the audio data I send to the decoder is also converted to 16khz first.

A follow up issue of the decoder failing on this scenario is in #187