NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
889 stars 177 forks source link

Cannot change speaker for interpolation #35

Open DamienToomey opened 4 years ago

DamienToomey commented 4 years ago

Hello,

I am trying to interpolate between two speakers. I am using the model pretrained on LibriTTS.

I have read the issue "How is interpolation between speakers performed?" #33 but I still cannot manage to make it work.

Here are the steps I have followed:

But when sampling z_1 and z_2, even multiple times, after generating the spectrogram with the pretrained Flowtron and generating the audio with the pretrained WaveGlow, the speaker sounds the same, only the audio quality seems to vary. (z_1 and z_2 have different values)

Thanks

rafaelvalle commented 4 years ago

You need to make sure z_1 and z_2 produce samples from different speakers. Sample z_1 once, perform inference and memorize the speaker's voice. Keep sampling z_2, performing inference and listening to the samples produced with z_2 until the speaker you hear is different from the speaker produced with z_1. You can interpolate once you have z_1 and z_2 values associated with different speakers. It is safer to let gate_threshold = 1 and prune the audio later.

DamienToomey commented 4 years ago

I have also model_config['dummy_speaker_embedding'] = True

I keep sampling z_2, performing inference and listening to the samples produced with z_2 but the speaker's voice sounds the same as the voice produced with z_1. By the way, it is always a female voice. Do you have any idea why this might be happening ?

rafaelvalle commented 4 years ago

Are you using the LibriTTS model?

DamienToomey commented 4 years ago

Yes I am using the LibriTTS model

rafaelvalle commented 4 years ago

Hey Damien, the pre-trained LibriTTS model available in our repo has speaker embeddings.

You need to train a model without speaker embeddings, i.e. model_config['dummy_speaker_embedding'] = True, to be able to interpolate between speakers through interpolation in the latent space.

You can warm-start from the pre-trained LibriTTS model with speaker embeddings.