RicherMans / GPV

Repository for our Interspeech2020 general-purpose voice activity detection (GPVAD) paper
https://arxiv.org/abs/2003.12222
GNU General Public License v3.0
142 stars 29 forks source link

Improving Performance on Shorter Audio Clips #5

Open shawnbzhang opened 4 years ago

shawnbzhang commented 4 years ago

Using your GPVAD/VADC, I wish to process smaller chunks (i.e. ~200ms chunks) of audio files. However, when the duration is this low, the performance of the VAD is poor. What can I do to better the performance? I assume this must be done in the training side. Would you recommend downloading the datasets and splicing them into these smaller chunks, retraining from scratch?

Curious to hear your thoughts. Thank you!

RicherMans commented 4 years ago

Hey there, well so far the proposed GPV is not "online" meaning that it does not directly output for each frame one probability. Performance is dependent on the utterance length, due to the bidirectional GRU getting more information.

What can I do to better the performance? I assume this must be done in the training side. Would you recommend downloading the datasets and splicing them into these smaller chunks, retraining from scratch?

Well, the point of the entire project is just to show that VAD can be trained on clip-level using weak (here inexact and noisy) supervision. If you have labels for every,e.g., 200ms, well just train a standard VAD model. However, in reality, I doubt that you have this type of supervision available, its too costly.

However, when the duration is this low, the performance of the VAD is poor.

Well, how about during testing you just splice some short utterances together?