bpotard / idlak

This repository is now obsolete. Please go to https://github.com/idlak/idlak instead.
https://github.com/idlak/idlak
Other
39 stars 15 forks source link

Is it possible to convert spectrogram to wav? #14

Open lifelongeek opened 6 years ago

lifelongeek commented 6 years ago

I have spectrogram given from the output of compute-spectrogram-feats(of KALDI), which is linear spectrogram magnitude.

Does idlak provides source to convert this spectrogram to raw wav?

I tried to use librosa in python but it seems that librosa and KALDI use different STFT algorithm. (https://stackoverflow.com/questions/43241612/spectrograms-generated-using-librosa-dont-look-consistent-with-kaldi)

bpotard commented 6 years ago

Hello,

It is not possible to convert a magnitude spectrogram directly back to raw wav - you have lost the phase and all voicing information, which is very important if you want something that sounds good. With the information in the spectrogram, you could rebuild whispered speech only (although some people manage to rebuild a relatively decent phase using specifically trained DNN on the original data - cf. baidu's paper, section 4.2.2). To rebuild a wav, you could convert your magnitude spectrum into a FFT with a zero phase, rebuild an excitation from the pitch extracted from the original signal, then convolve that excitation with your FFT - and it probably would not sound great.

Another (more correct?) option would be to extract the full FFT (not just the magnitude, but also the phase) and model those. But to have a chance to be able to model the phase components, you would need to extract the FFT pitch-synchronously, and then, resample to fixed frame rate. In other words, you would need to find the GCI (glottal closure instants) in the original wav (for example using reaper), and center your FFT window around those. I suppose once it is modeled, you could resample your full FFT to be pitch synchronous, and then recover a decent enough raw wav by ifft and using OLA.

pineking commented 6 years ago

@bpotard what about this one https://github.com/librosa/librosa/issues/434 ? audio can be synthesized from linear spectrogram magnitude(without phase information) using Griffin-Lim algorithm, the phase estimated from random init.

bpotard commented 6 years ago

This is rebuilding phase on a full magnitude spectrum (not a mel/log spectrogram), which has all the harmonics, and therefore the pitch information. Not sure how well it works for speech. Have you tried it?

lifelongeek commented 6 years ago

I extract linear spectrogram magnitude and did Griffin-Lim with librosa and it sounds good. This scheme is also used in recent TTS paper Tacotron (https://arxiv.org/abs/1703.10135).

However, my concern is that Short-Time Fourier Transform of librosa and KALDI is bit different, so Griffin-Lim with librosa did not work for spectrogram computed by KALDI. I wonder IDLAK provide source code of Griffin-Lim reconstruction for KALDI's spectrogram.

bpotard commented 6 years ago

I had a look, and the compute-spectrogram-feats always applies a log on the magnitude spectrum, so you would need to apply exp to it before you can use it in lws or whatever implementation of Griffin-Lim you want to use. Idlak or kaldi do not provide a Griffin-Lim implementation, but I will try to add an example with lws.

walidbou6 commented 2 years ago

I have trained a model to predict spectrogram of a target speaker speech, but the results I'm getting (we could say good) has unnatural artifacts in synthetic.

Phase reconstruction from amplitude spectrograms based on directional-statistics deep neural networks

In speech processing, an amplitude spectrogram is often used for processing, and the corresponding phases are reconstructed from the amplitude spectrogram by using the Griffin-Lim method. However, the Griffin-Lim method causes unnatural artifacts in synthetic speech.