amsehili / auditok

An audio/acoustic activity detection and audio segmentation tool
MIT License
724 stars 93 forks source link

Quality Benchmarks Between audiotok / webrtcvad / silero-vad #32

Open snakers4 opened 3 years ago

snakers4 commented 3 years ago

Here I will post our benchmarks comparing these three instruments

snakers4 commented 3 years ago

Instruments

We have compared 3 easy-to-use off-the-shelf instruments for voice activity / audio activity detection:

Caveats

Methodology

Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology

Quality Benchmarks

Finished tests:

image

Portability and Speed

This is by no means an extensive and full research on the topic, please point out if anything is lacking.

amsehili commented 3 years ago

Nice, thanks for sharing! I expected webrtc to perform much better than auditok given that it uses GMM models trained on large speech data. auditok's detection algorithm is as simple as a threshold comparison; the energy computation algorithm itself comes from the standard library (audioop module).

Its main strengths are a flexible and intuitive API for working with time (duration of speech an silence) and the ability to run online. The default detection algorithm can easily be replaced by a user-provided algorithm (see the validator argument in the split function), so in principle it can use webrtc or silero-vad as a backend detection algorithm.

snakers4 commented 3 years ago

Maybe it is just non optimal standard params, maybe it is our validation which is just calls annotated by STT and then hand checked

The only real way to find out is to share the results and see how other people measure their vads

As for usage of silero-vad as an engine - we deliberately kept it simple and omitted even module packaging because if you look past the data loading bits, it is literally loaded with 1 command torch.hub.load and the it just accepts audio as is

I am not sure yet how to better package it better