NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.29k stars 530 forks source link

Converting audio samples to mel then back to audio just generates noise. #262

Closed jamescasia closed 2 years ago

jamescasia commented 2 years ago

I have a small piece of code wherein I load a librosa sample audio, convert it to mel using librosa.feature.melspectogram with the parameters detailed in config.json, and then converting it back to audio with WaveGlow. But all I get is high frequency noise.


mel = librosa.feature.melspectrogram(y=y, sr=22050, n_mels = 80,hop_length = 256, n_fft=1024, win_length = 1024) 
mel = librosa.power_to_db(mel, ref = np.max )

with torch.no_grad():
    audio = waveglow.infer(torch.tensor(np.expand_dims(mel, axis = 0)).to('cuda'))

audio_numpy = audio[0].data.cpu().numpy()   

from scipy.io.wavfile import write 
write("audio.wav", rate, audio_numpy)
from IPython.display import Audio
Audio(audio_numpy, rate=22050)
w00zie commented 2 years ago

Hey @jamescasia i've noticed that you just closed the issue: did you find a solution?

jamescasia commented 2 years ago

Yes, I was able to convert audio to tacotron's mel spectogram and back to audio. I used TacotronSTFT's mel_spectogram function to properly convert the audio into mels. I then used Waveglow to convert it back, although the quality has declined a little bit. Here's how I did it.


def stft_to_tmel(stft):    
    stft = stft.unsqueeze(0)
    stft = torch.autograd.Variable(stft, requires_grad=False)

    mel = TacotronSTFT(filter_length=1024,
                                 hop_length=256,
                                 win_length=1024,
                                 sampling_rate=22050,
                                 mel_fmin=0.0, mel_fmax=8000.0).mel_spectrogram(stft)
    mel = torch.squeeze(mel, 0)
    return mel
w00zie commented 2 years ago

Thank you!