NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 184 forks source link

confused about source speaker id in style and rhythm transfer #18

Open JeffpanUK opened 4 years ago

JeffpanUK commented 4 years ago

Hi, I'm a little confused about the speaker id in the reference audio and text. When doing the style and rhythm transfer, the given reference speaker ids will be re-ordered as 0,1,2,... data_utils.py and inference script

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In that case, for the same audio like: "audio_10|text_10|10" in 2 different filelists

A.txt
audio_10|text_10|10
audio_0|text_0|0
B.txt
audio_10|text_10|10
audio_20|text_20|20

The reference speaker id(10) will be set as mellotron_id=1 and 0 respectively. It would be sure to cause the attention_map(A.K.A rhythm in Mellotron) to be different.

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

Is it as expected ? Or I've misunderstand somewhere?

rafaelvalle commented 4 years ago

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data. https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

JeffpanUK commented 4 years ago

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data. https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

Thanks for your reply. I've noticed this part of codes in training. However, my concern is that in the inference stage, when we tried to get the rhythm from some reference audios, we need to load the reference filelist with the TextMelLoader. I found that there is no speaker id dictionary given to the TextMelLoader.

arpabet_dict = cmudict.CMUDict('data/cmu_dictionary')
audio_paths = 'data/examples_filelist.txt'
dataloader = TextMelLoader(audio_paths, hparams)
datacollate = TextMelCollate(1)

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In this case, the mellotron speaker ids are related to the number of speakers in the reference filelist. Then, we do mellotron.forward to get the reference rhythm as below:

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

where the x contains ref_text, ref_mel, ref_f0 and ref_melltron_speaker_ids, and the generated rhythm will changed if the number of speakers in the reference filelist changed, for the same reference audio.

rafaelvalle commented 4 years ago

During experiments, we noticed that the rhythm (alignment map) we get from Tacotron seems to be independent of providing the correct speaker id. You can try, for example, to provide different speaker ids while using Tacotron as a forced aligner and observe if there is a significant difference.

rafaelvalle commented 4 years ago

Closing due to inactivity.