HGU-DLLAB / Korean-FastSpeech2-Pytorch

Implementation of Korean FastSpeech2
MIT License
214 stars 50 forks source link

Calculating validation loss during training #7

Closed suwhoanlim closed 3 years ago

suwhoanlim commented 3 years ago

Hello, @Jackson-Kang,

While I was reviewing the codes, I realized that in evaluation.py, when it evaluates the model and calculates the loss, it uses the target data, not the predicted model.

The pieces of code that I found suspicious are as follows:

72-79 in evaluation.py

with torch.no_grad():
    # Forward
    mel_output, mel_postnet_output, log_duration_output, f0_output, energy_output, src_mask, mel_mask, out_mel_len = model(text, src_len, mel_len, D, f0, energy, max_src_len, max_mel_len)

    # Cal Loss
    mel_loss, mel_postnet_loss, d_loss, f_loss, e_loss = Loss(log_duration_output, log_D, f0_output, f0, energy_output, energy, mel_output, mel_postnet_output, mel_target, ~src_mask, ~mel_mask)

38-49 in modules.py

if pitch_target is not None:
    pitch_embedding = self.pitch_embedding_producer(pitch_target.unsqueeze(2))
else:
    pitch_embedding = self.pitch_embedding_producer(pitch_prediction.unsqueeze(2))
    energy_prediction = self.energy_predictor(x, src_mask)
if energy_target is not None:
    energy_embedding = self.energy_embedding_producer(energy_target.unsqueeze(2))
else:
    energy_embedding = self.energy_embedding_producer(energy_prediction.unsqueeze(2))

Because the model of pitch and energy is defined when passing arguments to model() in evaluation.py, those targets will be used in modules.py, whereas I believe it should use the predicted model.

Is there any reason why we are using the target for validation? Or perhaps there something I missed?

Any comments would be appreciated, Thanks!

Jackson-Kang commented 3 years ago

Hello, @suwhoanlim .

I think that the reason is to provide accurate validation results for researchers. If input of model is different between train and validation steps, then we cannot know the exact validation result. Also, due to property of Multi-task leaner, the FastSpeech2 model produces poor quality of mel-spectrogram if we use predicted prosodies (pitch/energy). Then, we cannot exactly know which part of module produces poor results. For example, assume that pitch/energy predictor produces poor quality of estimated prosodies and we used them, then we don't exactly know whether pitch predictor produces poor qualities or decoder produces wrong results.

But, in a different point of view, you can edit this repository by your own for different goals. This may be arguable and open question, so I post "my opinion".

Jackson-Kang commented 3 years ago

Due to no activity, close this issue.