NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

Problem with the Mel Spectrogram Representation #41

Closed yliess86 closed 6 years ago

yliess86 commented 6 years ago

Request from issue #24: After training the model (70,000 iterations, val loss: 0.46, 0.34<= loss <= 0.49) and converting the obtained mel spectrogram to be feeded to the r9y9's wavenet vocoder, it turns out to sound like the voice has the flu.

mel = audio._amp_to_db(mel) - hparams.ref_level_db if not hparams.allow_clipping_in_normalization: assert mel.max() <= 0 and mel.min() - hparams.min_level_db >= 0 mel = audio._normalize(mel)



![Mel Spec](https://camo.githubusercontent.com/5ba4ce9a331453ae8c078d32653b36af12f9458a/68747470733a2f2f696d6167652e6962622e636f2f6665754a564a2f6d656c5f737065635f302e706e67)
rafaelvalle commented 6 years ago

The representation of the mel-spectrograms output by the Tacotron 2 model you trained does not match the mel-spectrogram used in r9y9's MoL WaveNet. More specifically, the minimum and maximum mel-spectrogram frequencies are different.

The code below converts a mel trained with the default mel-spectrogram representation in this repo to the representation used in r9y9's shared WaveNet MoL. Ideally one would train Tacotron 2 and Wavenet with the same mel representation, specially the minimum and maximum mel frequencies.

# load mel file output by Tacotron 2
mel = torch.autograd.Variable(torch.from_numpy(
    np.load('mel_spec.npy'))[None,:])

# Tacotron 2 Training Params
filter_length = 1024
hop_length = 256
win_length = 1024
sampling_rate = 22050
mel_fmin = 0.0
mel_fmax = None
taco_stft = TacotronSTFT(
    filter_length, hop_length, win_length, 
    sampling_rate=sampling_rate, mel_fmin=mel_fmin, 
    mel_fmax=mel_fmax)

# Project from Mel-Spectrogram to Spectrogram
mel_decompress = taco_stft.spectral_de_normalize(mel)
mel_decompress = mel_decompress.transpose(1, 2).data.cpu()
spec_from_mel_scaling = 1000
spec_from_mel = torch.mm(mel_decompress[0], taco_stft.mel_basis)
spec_from_mel = spec_from_mel.transpose(0, 1)
spec_from_mel = spec_from_mel * spec_from_mel_scaling

# WaveNet Decoder 2 Training Params
filter_length = 1024
hop_length = 256
win_length = 1024
sampling_rate = 22050
mel_fmin = 125
mel_fmax = 7600

taco_stft_other = TacotronSTFT(
    filter_length, hop_length, win_length, 
    sampling_rate=sampling_rate, mel_fmin=mel_fmin, mel_fmax=mel_fmax)

# Project from Spectrogram to r9y9's WaveNet Mel-Spectrogram
mel_minmax = taco_stft_other.spectral_normalize(
    torch.matmul(taco_stft_other.mel_basis, spec_from_mel))
rafaelvalle commented 6 years ago

The first few frames of the mel-spectrogram you provided sounds like this: yliess86_audio_trim.wav.zip

MXGray commented 6 years ago

@rafaelvalle Hope you could help me figure out if the default Tacotron2 hparams of this repo is a match to the nv-wavenet hparams I used below? Or if not, how can I ensure that it matches?

config.json of nv-wavenet/pytorch:

"data_config": {
    "training_files": "train_files.txt",
    "segment_length": 22050,
    "mu_quantization": 256,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "sampling_rate": 22050
},

"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321"
},

"wavenet_config": {
    "n_in_channels": 256,
    "n_layers": 16,
    "max_dilation": 128,
    "n_residual_channels": 64,
    "n_skip_channels": 256,
    "n_out_channels": 256,
    "n_cond_channels": 80,
    "upsamp_window": 1024,
    "upsamp_stride": 256
}

}

And, these are my Tacotron2 hparams:

    # Data Parameters             #
    load_mel_from_disk=False,
    training_files='filelists/ljs_audio_text_train_filelist.txt',
    validation_files='filelists/ljs_audio_text_val_filelist.txt',
    text_cleaners=['english_cleaners'],
    sort_by_length=False,

    # Audio Parameters             #
    max_wav_value=32768.0,
    sampling_rate=22050,
    filter_length=1024,
    hop_length=256,
    win_length=1024,
    n_mel_channels=80,
    mel_fmin=0.0,
    mel_fmax=None,  # if None, half the sampling rate

    # Model Parameters             #
    n_symbols=len(symbols),
    symbols_embedding_dim=512,

    # Encoder parameters
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,

    # Decoder parameters
    n_frames_per_step=1,  # currently only 1 is supported
    decoder_rnn_dim=1024,
    prenet_dim=256,
    max_decoder_steps=1000,
    gate_threshold=0.6,

    # Attention parameters
    attention_rnn_dim=1024,
    attention_dim=128,

    # Location Layer parameters
    attention_location_n_filters=32,
    attention_location_kernel_size=31,

    # Mel-post processing network parameters
    postnet_embedding_dim=512,
    postnet_kernel_size=5,
    postnet_n_convolutions=5,

    # Optimization Hyperparameters #
    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-6,
    grad_clip_thresh=1,
    batch_size=12,
    mask_padding=False  # set model's padded outputs to padded values
)

Would greatly appreciate your help. Thanks!

rafaelvalle commented 6 years ago

Yeah, it matches.

yliess86 commented 6 years ago

@rafaelvalle It is now working. Tank you for the help! When I will finish my prototype I will probably retrain both models with the same mel representation.

rafaelvalle commented 6 years ago

Closing. Please re-open if new issues appear!