bepierre / SpeechVGG

Feature extractor for DL speech processing.
GNU General Public License v3.0
65 stars 13 forks source link

How to use for speech inpainting? #5

Closed shamoons closed 3 years ago

MKegler commented 3 years ago

Hi, you can find details in the original paper: https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1532.pdf

Firstly, the network on its own cannot do speech inpainting (or at least we didn't try it). So for the inpainting, one needs another 'main' network. In the above paper, we used convolutional U-Net.

The main idea is to use a pre-trained speechVGG as a speech feature extractor for training the 'main' network. In particular, use it to process speech segments (reconstructed by the inpainting network and target during training) to obtain their representations (similarly as in the speech/music/noise classification example). Then the L1 loss for training the main framework can be computed between the representations obtained through speechVGG, as opposed to the direct L1 in the time-frequency domain. Specifically, in the inpainting paper, we are using activations at pooling layers and refer to the calculated loss as 'deep feature loss'.

Hope it helps!