General thoughts - Githubissues

GabrieleMazzola commented 5 years ago

In this project, they provide an example about how to feed mel spectrograms to a DNN. However, we need to deal with the fact that our inputs can have different lengths (we need a sequence2sequence architecture, I guess). In their project, they clearly specify that they address this issue by using a dataset with audios with the same length, but this is not feasible for us.

How to compute Mel spectrograms in Python

This in an interesting paper about sequence to sequence modeling, using LSTMs.

Using the mel spectrogram, we have a fixed representation AT EACH TIMESTEP, with of course a different number of timesteps, based on the length of the audio source. The fixed representation of each timestep basically consists of the different coefficients extracted for the different frequencies.

The general idea would be (TRAINING): 1) read audio 2) create mel spectrogram (original input) 3) add noise to the audio 4) create mel spectrogram (noisy input) 5) train a DNN (LSTM?) to map this mel spectrogram (noisy input) to the output mel spectrogram (original input)*

*for each training sample, the length of the input and the length of the output of the DNN are the same. However, they take on different lengths across samples.

All of this is based on the fact that we should be able to synthetize back the audio, given the mel spectrogram. We still have to verify this, but it looks possible

cseale commented 5 years ago

Below I detail some relevent papers in the area of speech denoising. The order of the papers is the recommended order of reading.

Good Starters

Deep Learning for Signals and Sounds Seminar - https://nl.mathworks.com/videos/deep-learning-for-signals-and-sound-1544467789023.html
Denoise Speech Using Deep Learning Networks - https://nl.mathworks.com/help/deeplearning/examples/denoise-speech-using-deep-learning-networks.html
Experiments on Deep Learning for Speech Denoising - https://paris.cs.illinois.edu/pubs/liu-interspeech2014.pdf

All of the above methods propose the use of STFT (short-time Fourier transforms) as the representation. With this representation, we can convert the output back to audio by using the Inverse STFT

A Bit More Advanced

Recurrent Neural Networks for Noise Reduction in Robust Automatic Speech Recognition - https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45168.pdf This method using Recurrent Network on MFCCs to clean the spectrogram for input then to automated speech recognition system. MFCCs seem to be the way to go in this regard, but perhaps not for us? We should use STFTs but maybe we can use the recurrent model proposed here with the STFT representation to generate our results?
RRN Noise Suppression: https://jmvalin.ca/papers/rnnoise_mmsp2018.pdf Blog Post: https://people.xiph.org/~jm/demo/rnnoise/ Github Repo: https://github.com/xiph/rnnoise/ Combines DSP approaches with an RNN. Would require a lost of effort on our behalf to mimic this. Probably not our recommended approach
Speech Denoising Using Deep Feature Losses - https://arxiv.org/pdf/1806.10522.pdf Trained directly on the raw waveform using a CNN with fixed input and sliding the window across the waveform to generate the output. Could be hard to reproduce

Not Read Yet

Wavenet Speech-Denoising: https://arxiv.org/pdf/1706.07162.pdf And it's github repo: https://github.com/drethage/speech-denoising-wavenet
Speech Enhancement In Multiple-Noise Conditions using Deep NeuralNetworks: https://arxiv.org/pdf/1605.02427.pdf Uses STFTs
Speech Enhancement Based on Deep Denoising Autoencoder: https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_0436.pdf Uses MFCCs

Other Possible Goldmines/Traps Shahla Parveen and Phil Green. Speech enhancement with missing data techniques usingrecurrent neural networks. InIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), volume 1, pages I–733, 2004

ong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. A regression approach to speech enhance-ment based on deep neural networks.IEEE/ACM Transactions on Audio, Speech and LanguageProcessing, 23(1):7–19, 2015.

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Florencio Dinei, and Mark Hasegawa-Johnson. Speech enhancement using bayesian wavenet. InInterspeech, 2017.

cseale commented 5 years ago

So here is my idea for the current approach:

Take some audio dataset and augment with noise
Compute the STFT for each sample
Train 3 different autoencoder architectures: i. Regular autoencoder (see Experiments on Deep Learning for Speech Denoising) ii. RNN (see Recurrent Neural Networks for Noise Reduction in Robust ASR) iii. CNN (still need an example paper)
Convert back to audio using inverse STFT

GabrieleMazzola commented 5 years ago

Do we have a running example of audio obtained by using the inverse STFT? (point 4)

GabrieleMazzola commented 5 years ago

https://www.kvraudio.com/forum/viewtopic.php?t=469887

rpytel1 commented 5 years ago

As we were last time discussing about next step I was thinking about using some pre-trained models or find some inspiration on Github. Results of what I found:

Simpler Approaches

simple FCN and RNN implementation for denoising They use stft! But also employ concept of IBM, probably need to do some research on that.
simple LSTM implementation for denoising Again stft and IBM concept used

Both projects are implemented in Tensorflow, however, we can extract concepts and implement them in Pytorch.

More Advanced approach

SEGAN pretty complex structure using Adversarial approach, however they provide pre-trained models in Pytorch

bianca26 commented 5 years ago

Stanford paper working on a very similar project as ours:

http://cs231n.stanford.edu/reports/2015/pdfs/Final_Report_mkayser_vzhong.pdf

rpytel1 commented 5 years ago

Current state:

[x] CNN(sliding window)
- [x] Batch norm -> Already implemented
[x] log transpose of input
[x] zero mean and unit variance (is it Batch norm?)
[x] CNN Bianqita -> also promising

Further steps:

RNN or LSTM? @rpytel1 will work on that
IBM: @cseale is working on that
More CNNs?

cseale / fix-it-in-post.net

General thoughts #13

Simpler Approaches

More Advanced approach