dpirch / libfvad

Voice activity detection (VAD) library, based on WebRTC's VAD engine
BSD 3-Clause "New" or "Revised" License
466 stars 168 forks source link

libfvad classifies any noice as a human voice. #23

Open ababo opened 4 years ago

ababo commented 4 years ago

Any noice or intense sound is classified as a human voice.

josharian commented 4 years ago

That's my experience, too.

pahlevan commented 4 years ago

me too

alamnasim commented 3 years ago

For me it classifies Music as Human voice. Can anyone confirm whether it is trained to detect Music as Human Voice or not?

jonnor commented 2 months ago

I have done some looking into this, and in my opinion these problems are due to the nature of the WebRTC Voice Activity Detection algorithm. It does online estimation that attempts to separate "background" (slowly changing) from "foreground" (rapidly changing). This is done using a Gaussian Mixture Model over 6 frequency sub-bands, with coefficients set to prefer speech bands. Conceptually, is an energy-based VAD with adaptive threshold.

So in practice, it acts more like a novelty detector - any (short) changes to the acoustical signal is considered a likely candidate to be "speech". This means that it is good for:

And that it is not good for:

So if those things are needed, one would need a more advanced algorithm. For example, a model trained on large dataets to separate speech from other sounds. This can possibly be done as a second stage after this VAD. Or the filterbank in WebRTC VAD (which is very computationally efficient) could be used as features for such a supervised model. I am considering doing the latter as an example/demo for the https://github.com/emlearn/emlearn project.