libfvad classifies any noice as a human voice.

ababo commented 4 years ago

Any noice or intense sound is classified as a human voice.

josharian commented 4 years ago

That's my experience, too.

pahlevan commented 4 years ago

me too

alamnasim commented 3 years ago

For me it classifies Music as Human voice. Can anyone confirm whether it is trained to detect Music as Human Voice or not?

jonnor commented 2 months ago

I have done some looking into this, and in my opinion these problems are due to the nature of the WebRTC Voice Activity Detection algorithm. It does online estimation that attempts to separate "background" (slowly changing) from "foreground" (rapidly changing). This is done using a Gaussian Mixture Model over 6 frequency sub-bands, with coefficients set to prefer speech bands. Conceptually, is an energy-based VAD with adaptive threshold.

So in practice, it acts more like a novelty detector - any (short) changes to the acoustical signal is considered a likely candidate to be "speech". This means that it is good for:

Separating between "silence" and speech.
Separating between slowly varying noise sources and speech. Say HVAC hum, car traffic at a distance, PC fan etc

And that it is not good for:

Separating repeated impulsive or intermittent noises from speech. Say keyboard clicking
Separating music from speech. Both vocals and non-vocal musical content
Separating backgrounds with a lot of near-constant noise, where the SNR of speech is low. Say standing close to a busy highway

So if those things are needed, one would need a more advanced algorithm. For example, a model trained on large dataets to separate speech from other sounds. This can possibly be done as a second stage after this VAD. Or the filterbank in WebRTC VAD (which is very computationally efficient) could be used as features for such a supervised model. I am considering doing the latter as an example/demo for the https://github.com/emlearn/emlearn project.

dpirch / libfvad

libfvad classifies any noice as a human voice. #23