CSTR-Edinburgh / magphase

MagPhase Vocoder: Speech analysis/synthesis system for TTS and related applications.
Apache License 2.0
78 stars 31 forks source link

What's the difference between waveform reconstruction using griffin-lim algorithm? #3

Closed npuichigo closed 6 years ago

npuichigo commented 6 years ago

Sorry, but I'm not familiar with signal processing. I want to ask now that we use amplitude and phase for acoustic modeling, why don't we just use inverse stft to reconstruct waveform?

felipeespic commented 6 years ago

Hi, The Griffin-Lim is an iterative algorithm that creates artificial phase spectrum derived from the magnitude spectrum. The MagPhase vocoder encodes the magnitude and phase spectra of speech. So, during reconstruction, it uses the phase spectrum extracted from natural speech (or predicted, in case of TTS), which results in a more natural sound quality.

For your second question: As you mentioned, MagPhase encodes the magnitude and phase, so ideally you could use just the simple IFFT for reconstruction. Actually, it does it for lossless decoding (see _demo_copy_synthesislossless.py). However, for acoustic modelling, the parameters are smoothed by the model (e.g., DNN) not capturing aperiodicities ("randomness") in speech. So, you need to recreate the "randomness" in certain parts of the signal, and for that, MagPhase uses withe noise, which is filtered and mixed with components predicted by the acoustic model.

npuichigo commented 6 years ago

Thank you for your reply!