descriptinc / melgan-neurips

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis
MIT License
964 stars 214 forks source link

about final loss? #4

Open MorganCZY opened 4 years ago

MorganCZY commented 4 years ago

Could you post your training loss curves with LJSpeech dataset? After 3k epochs, synthesized waves by my trained model are of poor quality compared to the released model "linda_johnson.pt". I wonder what your final losses are and whether there are some other training tricks. Thx in advance. (Here are some synthesized samples samples.zip )

fatchord commented 4 years ago

@MorganCZY This model needs a lot of training steps. I've trained one a million steps and it sounds great.

MorganCZY commented 4 years ago

But steps are more than two million after 3k epochs. Could you upload your tensorboard graphs? I want to check if the losses of my training process are of correct track.

m-toman commented 4 years ago

I think it depends a lot on the taco (or whatever) Mel quality. Did you use your own model?

MorganCZY commented 4 years ago

@m-toman I only tried to train a vocoder, not a whole TTS system. So, true mel-spectrograms rather than the outputs of a taco are used to train this MelGAN.

hyzhan commented 4 years ago

What about level of s_error can get an understandable audio ?

JunjunCui commented 4 years ago

But steps are more than two million after 3k epochs. Could you upload your tensorboard graphs? I want to check if the losses of my training process are of correct track.

Hello, how long does it take you to train one step? I used batch = 16 to train on the RTX2080 for more than 3 seconds.

himajin2045 commented 4 years ago

I train the model on the Chinese corpus SLR38 for 1.2 million steps and the generated result for an unseen speaker (from the same corpus but not in the training set) sounds really good.

It's worth noting that the generated audios still have some background noise when training for 0.9 million steps.

Using batch size 2, with a single RTX2080, the training speed is 100 steps per 17 seconds.

I changed the mel generation function a little bit to match other open source implementations (e.g. https://github.com/fatchord/WaveRNN/blob/master/utils/dsp.py):

class Audio2Mel(nn.Module):
    def __init__(
        self,
        n_fft=1024,
        hop_length=256,
        win_length=1024,
        sampling_rate=16000,
        n_mel_channels=80,
        mel_fmin=0.0,
        mel_fmax=None,
        min_level_db=16,
    ):
        super().__init__()
        ##############################################
        # FFT Parameters                              #
        ##############################################
        window = torch.hann_window(win_length).float()
        mel_basis = librosa_mel_fn(
            sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax
        )
        mel_basis = torch.from_numpy(mel_basis).float()
        self.register_buffer("mel_basis", mel_basis)
        self.register_buffer("window", window)
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.win_length = win_length
        self.sampling_rate = sampling_rate
        self.n_mel_channels = n_mel_channels
        self.min_level_db = min_level_db

    def forward(self, audio):
        p = (self.n_fft - self.hop_length) // 2
        audio = F.pad(audio, (p, p), "reflect").squeeze(1)
        fft = torch.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self.window,
            center=False,
        )
        real_part, imag_part = fft.unbind(-1)
        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
        mel_output = torch.matmul(self.mel_basis, magnitude)
        log_mel_spec = 20. * torch.log10(torch.clamp(mel_output, min=1e-5))
        log_mel_spec = torch.clamp((log_mel_spec - self.min_level_db) / -self.min_level_db, 0, 1)
        return log_mel_spec
Screen Shot 2020-01-02 at 10 28 37 AM
MorganCZY commented 4 years ago

@ye2020 How did you control the speaker embeddings? Besides, could you release some wav samples?

himajin2045 commented 4 years ago

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

nikawool commented 4 years ago

@ye2020 I've been training for 9 hours and it hasn't stopped. How long did you train the sample? Do you train with CPU or GPU?

hdmjdp commented 4 years ago

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

why mini_level_db=16? I think it is -100.

himajin2045 commented 4 years ago

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

why mini_level_db=16? I think it is -100.

Yes, it's -100.

min_level_db = -100
sample_rate = 16000
n_fft = 1024
num_mels = 80
fmin = 90
fmax = 7600
hop_length = 256
win_length = 1024
ref_level_db = 16
hdmjdp commented 4 years ago

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

why mini_level_db=16? I think it is -100.

Yes, it's -100.

min_level_db = -100
sample_rate = 16000
n_fft = 1024
num_mels = 80
fmin = 90
fmax = 7600
hop_length = 256
win_length = 1024
ref_level_db = 16

ref_level_db用到吗

plutols commented 4 years ago

@ye2020 when you predict vocoder output ,what is your vocoder input, it's from your tacotron output mel?

MayukhSobo commented 4 years ago

@ye2020 Can you please send me the complete training code for AutoVC. I am struggling to get the same output. You can email me on mayukh2012@hotmail.com

nkcdy commented 4 years ago

I train the model on the Chinese corpus SLR38 for 1.2 million steps and the generated result for an unseen speaker (from the same corpus but not in the training set) sounds really good. It's worth noting that the generated audios still have some background noise when training for 0.9 million steps. Using batch size 2, with a single RTX2080, the training speed is 100 steps per 17 seconds. I changed the mel generation function a little bit to match other open source implementations (e.g. https://github.com/fatchord/WaveRNN/blob/master/utils/dsp.py): class Audio2Mel(nn.Module): def init( self, n_fft=1024, hop_length=256, win_length=1024, sampling_rate=16000, n_mel_channels=80, mel_fmin=0.0, mel_fmax=None, min_level_db=16, ): super().init() ##############################################

FFT Parameters

    ##############################################
    window = torch.hann_window(win_length).float()
    mel_basis = librosa_mel_fn(
        sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax
    )
    mel_basis = torch.from_numpy(mel_basis).float()
    self.register_buffer("mel_basis", mel_basis)
    self.register_buffer("window", window)
    self.n_fft = n_fft
    self.hop_length = hop_length
    self.win_length = win_length
    self.sampling_rate = sampling_rate
    self.n_mel_channels = n_mel_channels
    self.min_level_db = min_level_db

def forward(self, audio):
    p = (self.n_fft - self.hop_length) // 2
    audio = F.pad(audio, (p, p), "reflect").squeeze(1)
    fft = torch.stft(
        audio,
        n_fft=self.n_fft,
        hop_length=self.hop_length,
        win_length=self.win_length,
        window=self.window,
        center=False,
    )
    real_part, imag_part = fft.unbind(-1)
    magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
    mel_output = torch.matmul(self.mel_basis, magnitude)
    log_mel_spec = 20. * torch.log10(torch.clamp(mel_output, min=1e-5))
    log_mel_spec = torch.clamp((log_mel_spec - self.min_level_db) / -self.min_level_db, 0, 1)
    return log_mel_spec

with the same Chinese corpus and the same hparameters, after training 1.2milion steps, the sound quality of the waveforms i got is very noisy. I don't know why