alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 339 forks source link

Only works when model is trained using 16khz audio data #187

Open alx741 opened 5 years ago

alx741 commented 5 years ago

Apparently, when the model is trained using audio data with a sample rate other than 16kHz, the decoder fails at decoding audio at any sample rate, even when tweaking the corresponding sample rate parameters on the request to the server (or in the client arguments for that matter).

This was the issue I was having in #186: My model was originally trained with 44.1khz audio data (with a matching MFCC config --sample-frequency=44100 of course). When I converted all my data to 16khz and re-trained the model, it worked perfectly.

NOTE: This problem is likely to be on Kaldi's decoder rather than kaldi-gstream-server, but this is where I first encounter it so I'm putting it here to promote further investigation.

svenha commented 5 years ago

Just curious: how does the performance (WER) differ between 44.1 kHz and 16 kHz?

alx741 commented 5 years ago

@svenha It actually improved, it dropped from WER=~12% (44.1khz) to WER=~8% (16khz)

svenha commented 5 years ago

So, 16 kHz is better? This would fit with other reports.

alx741 commented 5 years ago

So, 16 kHz is better? This would fit with other reports.

Yes, 16khz seems to be better