dpirch / libfvad

Voice activity detection (VAD) library, based on WebRTC's VAD engine
BSD 3-Clause "New" or "Revised" License
483 stars 172 forks source link

Does not take into account bit depth and channel numbers #7

Closed SephVelut closed 6 years ago

SephVelut commented 6 years ago

To measure how many bytes are in x milliseconds for an audio sample you must take into account the bit depth and number of channels. For example, to find n bytes in x milliseconds this is what I do

long bytes_per_second(int sample_rate, int8_t bit_depth, int8_t channels) {
    auto byte_depth = bit_depth / 8;

    return sample_rate * channels * byte_depth;
}

auto bytes_per_second = bytes_per_second(sample_rate, bit_depth, channels);
auto bytes_per_millisecond = bps / 1000;
auto bytes_per_chunk = bytes_per_millisecond * // 10, 20 or 30 milliseconds;

For 10 milliseconds of bytes of an 8000 sample rate this could be 80 bytes or 160 bytes for 16 bits or 160 bytes for 8 bits but 2 channels etc. Currently fvad_process only accepts 80 bytes for 8000 and 10 milliseconds. Does the WebRTC vad place these limitations?

Also, would it hurt the WebRTC's vad accuracy if I gave it bytes within the in-between range of 10 - 20 - 30 milliseconds? Like 11 or 24 milliseconds worth of bytes? This is a problem for me because I end up with left over bytes that don't fit neatly into 10,20,30.

So I was hoping I could redistribute the byte sets to include 1 more byte per chunk that will help use up the remainder bytes. Hope this makes sense given 18 bytes / 4 chunk_size = 4 bytes per chunk with remainder of 2 bytes solution could be to use 5 bytes per chunk for the first 10 bytes and then 4 byte chunks for the remaining 8 bytes

dpirch commented 6 years ago

The VAD engine internally works only with these fixed chunk sizes and only with 16-bit single-channel audio.

If you have samples in a different range, just multiply them with a factor so they end up in the 16-bit range, i.e. -32768 to 32767. If you have multi-channel-audio, you could just average the samples from all channels into one.

If your audio data isn't a multiple of 10/20/30 ms, either just drop the last rest at the end (maybe assume the detection result is the same as the last complete chunk before it), or fill up the last chunk with zeros.

(may I'll add this to the README...)

Currently fvad_process only accepts 80 bytes

80 is actually the number of 16-bit samples, not bytes, for a 10ms chunk at 8000Hz sample rate.