Full disclaimer - we are mostly interested in voice detection, not just silence detection;
In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
audiotok provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;
silero-vad is geared towards speech detection (as opposed to noise or music);
A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while audiotok and webrtcvad use 30-50ms chunks (we used default values of 30 ms for webrtcvad and 50 ms for audiotok );
We have excluded pyannote-audio for now (https://github.com/pyannote/pyannote-audio), since it features pre-trained models on only limited academic datasets and is mostly a recipe collection / toolkit to build your own tools, not a finished tool per se (also for such a simple task the amount of code bloat is puzzling from a production standpoint, our internal vad training code is just literally 5 python modules);
Looks like originally webrtcvad is written in С++ around 2016, so theoretically it can be ported into many platforms;
I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
Looks like audiotok is written in plain python, but I guess the algorithm itself can be ported;
silero-vad is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);
This is by no means an extensive and full research on the topic, please point out if anything is lacking.
Instruments
We have compared 3 easy-to-use off-the-shelf instruments for voice activity / audio activity detection:
Caveats
audiotok
provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;silero-vad
is geared towards speech detection (as opposed to noise or music);audiotok
andwebrtcvad
use 30-50ms chunks (we used default values of 30 ms forwebrtcvad
and 50 ms foraudiotok
);Methodology
Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology
Quality Benchmarks
Finished tests:
Portability and Speed
webrtcvad
is written inС++
around 2016, so theoretically it can be ported into many platforms;audiotok
is written in plain python, but I guess the algorithm itself can be ported;silero-vad
is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);This is by no means an extensive and full research on the topic, please point out if anything is lacking.