descriptinc / melgan-neurips

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis
MIT License
980 stars 214 forks source link

How to normalize mel spectorgram extracted by Audio2Wav class? #45

Open predawnang opened 1 year ago

predawnang commented 1 year ago

Hi,

I want to normalize mel spectrograms extracted by Audio2Wav class to the range [-1, 1], but I have no idea how to do it (I found some code that seems like normalization and then adapted to the Audio2Wav). I hope somebody could give me some advices.

I modified the code of Audio2Wav class base on other people's code hoping that could achieve the normalization.

    def forward(self, audio):
        p = (self.n_fft - self.hop_length) // 2
        audio = F.pad(audio, (p, p), "reflect").squeeze(1)
        fft = torch.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self.window,
            center=False,
        )
        real_part, imag_part = fft.unbind(-1)
        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
        mel_output = torch.matmul(self.mel_basis, magnitude)
        log_mel_spec = torch.log10(torch.clamp(mel_output, min=1e-5))

        # The code I added. ref_db => 20, dc_db => 100
        db_mel = 20 * log_mel_spec
        return (db_mel - ref_db + dc_db) / dc_db

Im not very sure what the purpose of the two line I added, could somebody help me figure out? for 20 * log_mel_spec, I guess is to convert amplitude to db scale, but Im not sure is it right to time 20 to the log_mel_spec. and (db_mel - ref_db + dc_db) / dc_db normalize the mels, and I don't know the technical name of this operation.

Thank you very much