Can we use this model to do VAD?

YLQY commented 3 years ago

Hello, I found this model in VaD recently. At present, we use the voice audio separated from this model to make VAD (using neural network and LSTM). At present, I have an idea. Since this model can separate voice and background, this model should know whether this frame is voice or background, so we can make a VAD, At present, I have tried the mask matrix, but the effect is not very good. After the mask matrix is visualized, it is similar to the separated vocal spectrum. But I don't want to give up the idea. Where should I modify it

Thanks

romi1502 commented 3 years ago

Hi @YLQY, Is VAD Voice Activity Detection? if so, sure you can use Spleeter for perfoming VAD. The easiest way would be just to operate a learned threshold on a smooth energy function of the separated vocal track. I think even such a basic setting should perform fairly. If you want a more elaborated system (and possibly better performance), you can indeed try to classify the output spectrogram frames of spleeter. I'm more skeptical about the mask as it discard any energy information, but it still could work. However, I would consider keeping both vocal and accompaniment outputs as Spleeter is not perfect and may not be able to separate correctly some particular vocals, which means any system working only on the separated vocals output won't be able to predict correctly this part of vocal activity (which would have remained in the unobserved accompaniment track).

YLQY commented 3 years ago

Thank you. I'll try it now

deezer / spleeter

Can we use this model to do VAD? #648