daanzu / kaldi-active-grammar

Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time
GNU Affero General Public License v3.0
332 stars 49 forks source link

VAD in example doesn't seem to work #70

Open p-e-w opened 2 years ago

p-e-w commented 2 years ago

When running full_example.py, the speech recognition itself works fine, but the VAD iterator completely fails to detect voice activity, distinguishing only between "sound" and "silence".

My understanding is that audio_iterator should yield a block of audio data if the input contains voice, and None otherwise. If so, this doesn't work on my system. As long as there is any sound being recorded by the microphone at all, the iterator yields audio blocks. I have tested this with snapping my fingers, scratching on the desk, even the background noise of a ceiling fan running – they all cause the iterator to produce blocks. Only virtually total silence produces None.

As a result, the end of phrase isn't detected unless the room is very, very quiet. I have done multiple test recordings from the same microphone setup and found them to be clear and without additional noise. Yet as soon as there is any input above a certain threshold, even if it is obviously non-human in origin, it is classified as voice. A modern VAD should be able to do much better.

Is this actually working for you? What could be the reason for the VAD to fail so completely?