Firstly, the network on its own cannot do speech inpainting (or at least we didn't try it). So for the inpainting, one needs another 'main' network. In the above paper, we used convolutional U-Net.
The main idea is to use a pre-trained speechVGG as a speech feature extractor for training the 'main' network. In particular, use it to process speech segments (reconstructed by the inpainting network and target during training) to obtain their representations (similarly as in the speech/music/noise classification example). Then the L1 loss for training the main framework can be computed between the representations obtained through speechVGG, as opposed to the direct L1 in the time-frequency domain. Specifically, in the inpainting paper, we are using activations at pooling layers and refer to the calculated loss as 'deep feature loss'.
Hi, you can find details in the original paper: https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1532.pdf
Firstly, the network on its own cannot do speech inpainting (or at least we didn't try it). So for the inpainting, one needs another 'main' network. In the above paper, we used convolutional U-Net.
The main idea is to use a pre-trained speechVGG as a speech feature extractor for training the 'main' network. In particular, use it to process speech segments (reconstructed by the inpainting network and target during training) to obtain their representations (similarly as in the speech/music/noise classification example). Then the L1 loss for training the main framework can be computed between the representations obtained through speechVGG, as opposed to the direct L1 in the time-frequency domain. Specifically, in the inpainting paper, we are using activations at pooling layers and refer to the calculated loss as 'deep feature loss'.
Hope it helps!