Use output of lstm as embeddings for classification task

facebookresearch / denoiser

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. In which, we present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities.

Other

1.65k stars 302 forks source link

Hello @adefossez and @adiyoss , thanks a lot for this wonderful and helpful repository.

I have been using this architecture for enhancing human voices, and thought of using it for a different task at hand. I want to build a classifier, which takes input a sound clip of fixed length, and tells if adult human voices are present in it or not.

Will it be wise, to use the output of the LSTM, as embeddings, and build a classifier on top of it?

Also, the bigger task I was having, is given a clip, having baby sounds in the foreground, and adult humans talking sounds in the background. I want to get rid of the talking sounds. I tried multiple approaches:

1) use the output of the pretrained denoiser, and subtract it from the original wav file. This didn't work, because (a) the reverberation of the human talking sounds were still present in the modified wav file. (b) the pretrained model was enhancing the human talking sounds for some part, and baby sounds in other parts. (c) this got more complicated when more than one speaker were talking in the background.

2) tried retraining the denoiser model from scratch, with the focus on enhancing the baby sounds, but it didn't work robustly, and for few files, they human speech was left intact, and in a few files, the speech was suppressed yet a small devilish sound was left.

3) thought of posing this problem as a classification task, and maybe the embeddings from the output of LSTM, can be used as feature vectors, which are discriminative for baby sounds and normal human speech.

It would be great, if you give your insights and intuitions on this approach.

Thanks! Gaurav

Hello @parashar-gaurav , it is likely some information can be extracted from the output of the LSTM. However there might also be information in the lower level skip U-net connection, so it is not entirely sure all the information would be in the LSTM output.

I think the hard part with your task is that denoiser was primarily trained for keep a single speaker that is clearly the main speaker (i.e. close to the mic and loud). multiple speaker in the distance will be considered as noise, but the thing is, baby babbling or crying will be considered as noise as well and removed. So there is no way to reliably use the denoiser. You could try to fine tune, hoping that some of the internal weights and learnt features would still be relevant but I'm not sure how that would work.

For your 2., it all depends on the amount of training data you have.

facebookresearch / denoiser

Use output of lstm as embeddings for classification task #91