Closed npuichigo closed 6 years ago
Hi, The Griffin-Lim is an iterative algorithm that creates artificial phase spectrum derived from the magnitude spectrum. The MagPhase vocoder encodes the magnitude and phase spectra of speech. So, during reconstruction, it uses the phase spectrum extracted from natural speech (or predicted, in case of TTS), which results in a more natural sound quality.
For your second question: As you mentioned, MagPhase encodes the magnitude and phase, so ideally you could use just the simple IFFT for reconstruction. Actually, it does it for lossless decoding (see _demo_copy_synthesislossless.py). However, for acoustic modelling, the parameters are smoothed by the model (e.g., DNN) not capturing aperiodicities ("randomness") in speech. So, you need to recreate the "randomness" in certain parts of the signal, and for that, MagPhase uses withe noise, which is filtered and mixed with components predicted by the acoustic model.
Thank you for your reply!
Sorry, but I'm not familiar with signal processing. I want to ask now that we use amplitude and phase for acoustic modeling, why don't we just use inverse stft to reconstruct waveform?