cseale / fix-it-in-post.net

fix-it-in-post.net
0 stars 0 forks source link

General thoughts #13

Open GabrieleMazzola opened 5 years ago

GabrieleMazzola commented 5 years ago

In this project, they provide an example about how to feed mel spectrograms to a DNN. However, we need to deal with the fact that our inputs can have different lengths (we need a sequence2sequence architecture, I guess). In their project, they clearly specify that they address this issue by using a dataset with audios with the same length, but this is not feasible for us.

How to compute Mel spectrograms in Python

This in an interesting paper about sequence to sequence modeling, using LSTMs.

Using the mel spectrogram, we have a fixed representation AT EACH TIMESTEP, with of course a different number of timesteps, based on the length of the audio source. The fixed representation of each timestep basically consists of the different coefficients extracted for the different frequencies.

The general idea would be (TRAINING): 1) read audio 2) create mel spectrogram (original input) 3) add noise to the audio 4) create mel spectrogram (noisy input) 5) train a DNN (LSTM?) to map this mel spectrogram (noisy input) to the output mel spectrogram (original input)*

*for each training sample, the length of the input and the length of the output of the DNN are the same. However, they take on different lengths across samples.

All of this is based on the fact that we should be able to synthetize back the audio, given the mel spectrogram. We still have to verify this, but it looks possible

cseale commented 5 years ago

Below I detail some relevent papers in the area of speech denoising. The order of the papers is the recommended order of reading.

Good Starters

All of the above methods propose the use of STFT (short-time Fourier transforms) as the representation. With this representation, we can convert the output back to audio by using the Inverse STFT

A Bit More Advanced

Not Read Yet

Other Possible Goldmines/Traps Shahla Parveen and Phil Green. Speech enhancement with missing data techniques usingrecurrent neural networks. InIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), volume 1, pages I–733, 2004

ong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. A regression approach to speech enhance-ment based on deep neural networks.IEEE/ACM Transactions on Audio, Speech and LanguageProcessing, 23(1):7–19, 2015.

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Florencio Dinei, and Mark Hasegawa-Johnson. Speech enhancement using bayesian wavenet. InInterspeech, 2017.

cseale commented 5 years ago

So here is my idea for the current approach:

  1. Take some audio dataset and augment with noise
  2. Compute the STFT for each sample
  3. Train 3 different autoencoder architectures: i. Regular autoencoder (see Experiments on Deep Learning for Speech Denoising) ii. RNN (see Recurrent Neural Networks for Noise Reduction in Robust ASR) iii. CNN (still need an example paper)
  4. Convert back to audio using inverse STFT
GabrieleMazzola commented 5 years ago

Do we have a running example of audio obtained by using the inverse STFT? (point 4)

GabrieleMazzola commented 5 years ago

https://www.kvraudio.com/forum/viewtopic.php?t=469887

rpytel1 commented 5 years ago

As we were last time discussing about next step I was thinking about using some pre-trained models or find some inspiration on Github. Results of what I found:

Simpler Approaches

Both projects are implemented in Tensorflow, however, we can extract concepts and implement them in Pytorch.

More Advanced approach

bianca26 commented 5 years ago

Stanford paper working on a very similar project as ours:

rpytel1 commented 5 years ago

Current state:

Further steps: