Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Here is a bug on linear loss computation #113

Closed begeekmyfriend closed 6 years ago

begeekmyfriend commented 6 years ago

In the expression computing linear loss, num_mels should have been num_freq. See Keith Ito's version. It seems that this model does not compute the loss including effective bandwidth of the audio.

Yeongtae commented 6 years ago

@begeekmyfriend how to fix it? just replace num_mels to num_freq? When we fix it, what is the improvement compare with the previous one?

In my opinion, The tacotorn part converges well unlike the wavenet_vocoder part.

begeekmyfriend commented 6 years ago

linear_loss = 0.5 tf.reduce_mean(l1) + 0.5 tf.reduce_mean(l1[:,:,0:n_priority_freq])

This expression means we use 0.5 weight of the whole bandwidth of the frequency plus the remaining 0.5 weight of the priority bandwidth of the frequency as the complete linear loss to train the model. The num_freq factor would make influence on the bandwidth. So in my humble opinion, the higher frequency of the targets would be taken part in fitting the ground truth audio. Therefore Keith Ito's version is all you need.

Yeongtae commented 6 years ago

@begeekmyfriend Thank you for your good opinion. did you test it? did it reduce noise the result such as 'step-xxxx-wave-from-mels'??

begeekmyfriend commented 6 years ago

It does nothing with the mel outputs. By the way, the quality of linear outputs typically perform better than mel ones. By the way, here is my hyper parameters (under testing). We can see that I use 2048 fft size and 1025 number of frequency with Griffin-Lim vocoder.

    #Audio
    num_mels = 80, #Number of mel-spectrogram channels and local conditioning dimensionality
    num_freq = 1025, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
    rescale = True, #Whether to rescale audio prior to preprocessing
    rescaling_max = 0.999, #Rescaling value
    trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
    clip_mels_length = True, #For cases of OOM (Not really recommended, working on a workaround)
    max_mel_frames = 960,  #Only relevant when clip_mels_length = True

    # Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
    # It's preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
    # Does not work if n_ffit is not multiple of hop_size!!
    use_lws=False,
    silence_threshold=2, #silence threshold used for sound trimming for wavenet preprocessing

    #Mel spectrogram
    n_fft = 2048, #Extra window size is filled with 0 paddings to match this parameter
    hop_size = None, #For 22050Hz, 275 ~= 12.5 ms
    win_size = 1100, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
    sample_rate = 22050, #22050 Hz (corresponding to ljspeech dataset)
    frame_shift_ms = 12.5,
Yeongtae commented 6 years ago

image @begeekmyfriend does it affect this part? Thanks a lot.

begeekmyfriend commented 6 years ago

It definitely does because I have expanded both the fft size and the number of frequency of linear outputs. So the audio signal process would be affected. That is saying you have to pre-process all the audio dataset and run again from scratch. By the way, these hyper parameters do not match wavenet vocoder but only for Griffin-Lim.

begeekmyfriend commented 6 years ago

We can also use L2 loss for linear outputs if you think L2 is better than L1

n_priority_freq = int(4000 / (hp.sample_rate * 0.5) * hp.num_freq)
linear_loss = 0.5 * tf.losses.mean_squared_error(self.linear_targets, self.linear_outputs) \
        + 0.5 * tf.losses.mean_squared_error(self.linear_targets[:,:,0:n_priority_freq], self.linear_outputs[:,:,0:n_priority_freq])

The reason why we prefer L2 is mentioned here https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-375214086

Yeongtae commented 6 years ago

This is my test results.

--num_mels: 10000iteration-- step-10000-eval-mel-spectrogram step-10000-eval-align

--num_freq: 10000iteration-- step-10000-eval-mel-spectrogram step-10000-eval-align

begeekmyfriend commented 6 years ago

I am afraid there might be problems in your dataset. In my test it would achieve convergence in 4K steps when I adopted the solutions mentioned both on the 5th and 8th floor that used MSE for linear loss. step-4000-align And below is one of the results from Griffin-Lim in 15K steps step-15000-eval-waveform-linear.zip

Starlon87 commented 6 years ago

@begeekmyfriend 为什么你的4000 steps,loss = 0.59就可以看起来如此收敛,而我这边70000 steps,loss = 0.37还没苗条的曲线?...

step-70000-eval-align

Rayhane-mamah commented 6 years ago

@begeekmyfriend my good friend you are correct once more!

I have fixed that.. I apologize for the typo :) Thanks for your feedback!