Closed SephVelut closed 6 years ago
The VAD engine internally works only with these fixed chunk sizes and only with 16-bit single-channel audio.
If you have samples in a different range, just multiply them with a factor so they end up in the 16-bit range, i.e. -32768 to 32767. If you have multi-channel-audio, you could just average the samples from all channels into one.
If your audio data isn't a multiple of 10/20/30 ms, either just drop the last rest at the end (maybe assume the detection result is the same as the last complete chunk before it), or fill up the last chunk with zeros.
(may I'll add this to the README...)
Currently fvad_process only accepts 80 bytes
80 is actually the number of 16-bit samples, not bytes, for a 10ms chunk at 8000Hz sample rate.
To measure how many bytes are in x milliseconds for an audio sample you must take into account the bit depth and number of channels. For example, to find n bytes in x milliseconds this is what I do
For 10 milliseconds of bytes of an 8000 sample rate this could be 80 bytes or 160 bytes for 16 bits or 160 bytes for 8 bits but 2 channels etc. Currently fvad_process only accepts 80 bytes for 8000 and 10 milliseconds. Does the WebRTC vad place these limitations?
Also, would it hurt the WebRTC's vad accuracy if I gave it bytes within the in-between range of 10 - 20 - 30 milliseconds? Like 11 or 24 milliseconds worth of bytes? This is a problem for me because I end up with left over bytes that don't fit neatly into 10,20,30.
So I was hoping I could redistribute the byte sets to include 1 more byte per chunk that will help use up the remainder bytes. Hope this makes sense
given 18 bytes / 4 chunk_size = 4 bytes per chunk with remainder of 2 bytes
solution could be to use5 bytes per chunk for the first 10 bytes and then 4 byte chunks for the remaining 8 bytes