Cannot change speaker for interpolation

DamienToomey commented 4 years ago

Hello,

I am trying to interpolate between two speakers. I am using the model pretrained on LibriTTS.

I have read the issue "How is interpolation between speakers performed?" #33 but I still cannot manage to make it work.

Here are the steps I have followed:

gate_threshold = 1 (as mentioned in #33)
set 'dummy_speaker_embedding = True` in config.json as in the paper is written "For the experiment without speaker embeddings we interpolate between Sally and Helen using the phrase “We are testing this model.”."
I have removed seeds torch.manual_seed(seed) and torch.cuda.manual_seed(seed) from inference.py
z_1 ∼ N(0, 0.5) (as in paper)
z_2 ∼ N(0, 0.5) (as in paper)
interpolation
reset gate_threshold = 0.5
model.infer
waveglow.infer

But when sampling z_1 and z_2, even multiple times, after generating the spectrogram with the pretrained Flowtron and generating the audio with the pretrained WaveGlow, the speaker sounds the same, only the audio quality seems to vary. (z_1 and z_2 have different values)

Could you tell me which of the above steps I have done wrong or if I have forgotten any steps?
Once I have found z_1 and z_2 that I want to interpolate, do I have to reset gate_threshold = 0.5 before interpolation?
Why did we have to set gate_threshold = 1 in the first place when looking for z_1 and z_2?

Thanks

rafaelvalle commented 4 years ago

You need to make sure z_1 and z_2 produce samples from different speakers. Sample z_1 once, perform inference and memorize the speaker's voice. Keep sampling z_2, performing inference and listening to the samples produced with z_2 until the speaker you hear is different from the speaker produced with z_1. You can interpolate once you have z_1 and z_2 values associated with different speakers. It is safer to let gate_threshold = 1 and prune the audio later.

DamienToomey commented 4 years ago

I have also model_config['dummy_speaker_embedding'] = True

I keep sampling z_2, performing inference and listening to the samples produced with z_2 but the speaker's voice sounds the same as the voice produced with z_1. By the way, it is always a female voice. Do you have any idea why this might be happening ?

rafaelvalle commented 4 years ago

Are you using the LibriTTS model?

DamienToomey commented 4 years ago

Yes I am using the LibriTTS model

rafaelvalle commented 4 years ago

Hey Damien, the pre-trained LibriTTS model available in our repo has speaker embeddings.

You need to train a model without speaker embeddings, i.e. model_config['dummy_speaker_embedding'] = True, to be able to interpolate between speakers through interpolation in the latent space.

You can warm-start from the pre-trained LibriTTS model with speaker embeddings.

NVIDIA / flowtron

Cannot change speaker for interpolation #35