Closed yliess86 closed 6 years ago
The representation of the mel-spectrograms output by the Tacotron 2 model you trained does not match the mel-spectrogram used in r9y9's MoL WaveNet. More specifically, the minimum and maximum mel-spectrogram frequencies are different.
The code below converts a mel trained with the default mel-spectrogram representation in this repo to the representation used in r9y9's shared WaveNet MoL. Ideally one would train Tacotron 2 and Wavenet with the same mel representation, specially the minimum and maximum mel frequencies.
# load mel file output by Tacotron 2
mel = torch.autograd.Variable(torch.from_numpy(
np.load('mel_spec.npy'))[None,:])
# Tacotron 2 Training Params
filter_length = 1024
hop_length = 256
win_length = 1024
sampling_rate = 22050
mel_fmin = 0.0
mel_fmax = None
taco_stft = TacotronSTFT(
filter_length, hop_length, win_length,
sampling_rate=sampling_rate, mel_fmin=mel_fmin,
mel_fmax=mel_fmax)
# Project from Mel-Spectrogram to Spectrogram
mel_decompress = taco_stft.spectral_de_normalize(mel)
mel_decompress = mel_decompress.transpose(1, 2).data.cpu()
spec_from_mel_scaling = 1000
spec_from_mel = torch.mm(mel_decompress[0], taco_stft.mel_basis)
spec_from_mel = spec_from_mel.transpose(0, 1)
spec_from_mel = spec_from_mel * spec_from_mel_scaling
# WaveNet Decoder 2 Training Params
filter_length = 1024
hop_length = 256
win_length = 1024
sampling_rate = 22050
mel_fmin = 125
mel_fmax = 7600
taco_stft_other = TacotronSTFT(
filter_length, hop_length, win_length,
sampling_rate=sampling_rate, mel_fmin=mel_fmin, mel_fmax=mel_fmax)
# Project from Spectrogram to r9y9's WaveNet Mel-Spectrogram
mel_minmax = taco_stft_other.spectral_normalize(
torch.matmul(taco_stft_other.mel_basis, spec_from_mel))
The first few frames of the mel-spectrogram you provided sounds like this: yliess86_audio_trim.wav.zip
@rafaelvalle Hope you could help me figure out if the default Tacotron2 hparams of this repo is a match to the nv-wavenet hparams I used below? Or if not, how can I ensure that it matches?
config.json of nv-wavenet/pytorch:
"data_config": {
"training_files": "train_files.txt",
"segment_length": 22050,
"mu_quantization": 256,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"sampling_rate": 22050
},
"dist_config": {
"dist_backend": "nccl",
"dist_url": "tcp://localhost:54321"
},
"wavenet_config": {
"n_in_channels": 256,
"n_layers": 16,
"max_dilation": 128,
"n_residual_channels": 64,
"n_skip_channels": 256,
"n_out_channels": 256,
"n_cond_channels": 80,
"upsamp_window": 1024,
"upsamp_stride": 256
}
}
And, these are my Tacotron2 hparams:
# Data Parameters #
load_mel_from_disk=False,
training_files='filelists/ljs_audio_text_train_filelist.txt',
validation_files='filelists/ljs_audio_text_val_filelist.txt',
text_cleaners=['english_cleaners'],
sort_by_length=False,
# Audio Parameters #
max_wav_value=32768.0,
sampling_rate=22050,
filter_length=1024,
hop_length=256,
win_length=1024,
n_mel_channels=80,
mel_fmin=0.0,
mel_fmax=None, # if None, half the sampling rate
# Model Parameters #
n_symbols=len(symbols),
symbols_embedding_dim=512,
# Encoder parameters
encoder_kernel_size=5,
encoder_n_convolutions=3,
encoder_embedding_dim=512,
# Decoder parameters
n_frames_per_step=1, # currently only 1 is supported
decoder_rnn_dim=1024,
prenet_dim=256,
max_decoder_steps=1000,
gate_threshold=0.6,
# Attention parameters
attention_rnn_dim=1024,
attention_dim=128,
# Location Layer parameters
attention_location_n_filters=32,
attention_location_kernel_size=31,
# Mel-post processing network parameters
postnet_embedding_dim=512,
postnet_kernel_size=5,
postnet_n_convolutions=5,
# Optimization Hyperparameters #
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1,
batch_size=12,
mask_padding=False # set model's padded outputs to padded values
)
Would greatly appreciate your help. Thanks!
Yeah, it matches.
@rafaelvalle It is now working. Tank you for the help! When I will finish my prototype I will probably retrain both models with the same mel representation.
Closing. Please re-open if new issues appear!
Request from issue #24: After training the model (70,000 iterations, val loss: 0.46, 0.34<= loss <= 0.49) and converting the obtained mel spectrogram to be feeded to the r9y9's wavenet vocoder, it turns out to sound like the voice has the flu.
'This is an example of text to speech synthesis after 9 days training. This may sound awful, but it is a start.'
mel = audio._amp_to_db(mel) - hparams.ref_level_db if not hparams.allow_clipping_in_normalization: assert mel.max() <= 0 and mel.min() - hparams.min_level_db >= 0 mel = audio._normalize(mel)