Open shawnbzhang opened 4 years ago
Hey there, well so far the proposed GPV is not "online" meaning that it does not directly output for each frame one probability. Performance is dependent on the utterance length, due to the bidirectional GRU getting more information.
What can I do to better the performance? I assume this must be done in the training side. Would you recommend downloading the datasets and splicing them into these smaller chunks, retraining from scratch?
Well, the point of the entire project is just to show that VAD can be trained on clip-level using weak (here inexact and noisy) supervision. If you have labels for every,e.g., 200ms, well just train a standard VAD model. However, in reality, I doubt that you have this type of supervision available, its too costly.
However, when the duration is this low, the performance of the VAD is poor.
Well, how about during testing you just splice some short utterances together?
Using your GPVAD/VADC, I wish to process smaller chunks (i.e. ~200ms chunks) of audio files. However, when the duration is this low, the performance of the VAD is poor. What can I do to better the performance? I assume this must be done in the training side. Would you recommend downloading the datasets and splicing them into these smaller chunks, retraining from scratch?
Curious to hear your thoughts. Thank you!