Closed begeekmyfriend closed 6 years ago
@begeekmyfriend how to fix it? just replace num_mels to num_freq? When we fix it, what is the improvement compare with the previous one?
In my opinion, The tacotorn part converges well unlike the wavenet_vocoder part.
linear_loss = 0.5 tf.reduce_mean(l1) + 0.5 tf.reduce_mean(l1[:,:,0:n_priority_freq])
This expression means we use 0.5 weight of the whole bandwidth of the frequency plus the remaining 0.5 weight of the priority bandwidth of the frequency as the complete linear loss to train the model. The num_freq
factor would make influence on the bandwidth. So in my humble opinion, the higher frequency of the targets would be taken part in fitting the ground truth audio. Therefore Keith Ito's version is all you need.
@begeekmyfriend Thank you for your good opinion. did you test it? did it reduce noise the result such as 'step-xxxx-wave-from-mels'??
It does nothing with the mel outputs. By the way, the quality of linear outputs typically perform better than mel ones. By the way, here is my hyper parameters (under testing). We can see that I use 2048 fft size and 1025 number of frequency with Griffin-Lim vocoder.
#Audio
num_mels = 80, #Number of mel-spectrogram channels and local conditioning dimensionality
num_freq = 1025, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
rescale = True, #Whether to rescale audio prior to preprocessing
rescaling_max = 0.999, #Rescaling value
trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
clip_mels_length = True, #For cases of OOM (Not really recommended, working on a workaround)
max_mel_frames = 960, #Only relevant when clip_mels_length = True
# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
# It's preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
# Does not work if n_ffit is not multiple of hop_size!!
use_lws=False,
silence_threshold=2, #silence threshold used for sound trimming for wavenet preprocessing
#Mel spectrogram
n_fft = 2048, #Extra window size is filled with 0 paddings to match this parameter
hop_size = None, #For 22050Hz, 275 ~= 12.5 ms
win_size = 1100, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
sample_rate = 22050, #22050 Hz (corresponding to ljspeech dataset)
frame_shift_ms = 12.5,
@begeekmyfriend does it affect this part? Thanks a lot.
It definitely does because I have expanded both the fft size and the number of frequency of linear outputs. So the audio signal process would be affected. That is saying you have to pre-process all the audio dataset and run again from scratch. By the way, these hyper parameters do not match wavenet vocoder but only for Griffin-Lim.
We can also use L2 loss for linear outputs if you think L2 is better than L1
n_priority_freq = int(4000 / (hp.sample_rate * 0.5) * hp.num_freq)
linear_loss = 0.5 * tf.losses.mean_squared_error(self.linear_targets, self.linear_outputs) \
+ 0.5 * tf.losses.mean_squared_error(self.linear_targets[:,:,0:n_priority_freq], self.linear_outputs[:,:,0:n_priority_freq])
The reason why we prefer L2 is mentioned here https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-375214086
This is my test results.
--num_mels: 10000iteration--
--num_freq: 10000iteration--
I am afraid there might be problems in your dataset. In my test it would achieve convergence in 4K steps when I adopted the solutions mentioned both on the 5th and 8th floor that used MSE for linear loss. And below is one of the results from Griffin-Lim in 15K steps step-15000-eval-waveform-linear.zip
@begeekmyfriend 为什么你的4000 steps,loss = 0.59就可以看起来如此收敛,而我这边70000 steps,loss = 0.37还没苗条的曲线?...
@begeekmyfriend my good friend you are correct once more!
I have fixed that.. I apologize for the typo :) Thanks for your feedback!
In the expression computing linear loss,
num_mels
should have beennum_freq
. See Keith Ito's version. It seems that this model does not compute the loss including effective bandwidth of the audio.