descriptinc / melgan-neurips

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis
MIT License
976 stars 214 forks source link

Why perform Audio2Mel's method on extracting mel spectrogram? #36

Open shawnbzhang opened 3 years ago

shawnbzhang commented 3 years ago

Audio2Mel does the following to extract the mel spectrogram:

    data, sampling_rate = load(full_path, sr=self.sampling_rate)
    data = 0.95 * normalize(data)

    if self.augment:
        amplitude = np.random.uniform(low=0.3, high=1.0)
        data = data * amplitude

    return torch.from_numpy(data).float(), sampling_rate

which is forwarded as:

    def forward(self, audio):
        p = (self.n_fft - self.hop_length) // 2
        audio = F.pad(audio, (p, p), "reflect").squeeze(1)
        fft = torch.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self.window,
            center=False,
        )
        real_part, imag_part = fft.unbind(-1)
        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
        mel_output = torch.matmul(self.mel_basis, magnitude)
        log_mel_spec = torch.log10(torch.clamp(mel_output, min=1e-5))
        return log_mel_spec

Is there a benefit of doing this over Torchaudio's mel spectrogram function, e.g.:

    data, sampling_rate = torchaudio.load(full_path)
    melspec_ops = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate,
        n_fft=self.n_fft,
        win_length=self.win_length,
        hop_length=self.hop_length,
        f_min=0,
        f_max=None,
        n_mels=self.n_mel_channels)

    mel_spec = melspec_ops(data)

    log_mel_spec = torch.log10(mel_spec + 0.000000001)
    return log_mel_spec

I'm just curious about this design choice — it wasn't really touched in the paper.

Side quesiton: Why do you multiply the normalized waveform by 0.95 in the original method?

J0shuaFernandes commented 2 years ago

Can you share the full Audio2Mel code