CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.79k stars 8.8k forks source link

Poor attention with a different speaker encoder #1051

Open MLrookie opened 2 years ago

MLrookie commented 2 years ago

First, Thanks for the excellent work by CorentinJ! I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:

tts_schedule = [(2, 1e-3, 10_000, 32), (2, 5e-4, 15_000, 32), (2, 2e-4, 20_000, 32), (2, 1e-4, 30_000, 32), (2, 5e-5, 40_000, 32), (2, 1e-5, 60_000, 32), (2, 5e-6, 160_000, 32), (2, 3e-6, 320_000, 32), (2, 3e-6, 640_000, 32)] speaker_embedding_size = 192

However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.

Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?

raccoonML commented 2 years ago

Which speech dataset are you using? You should be using LibriSpeech or LibriTTS if you want to compare results to the pretrained models of this repo.

MLrookie commented 2 years ago

Which speech dataset are you using? You should be using LibriSpeech or LibriTTS if you want to compare results to the pretrained models of this repo.

I don't use LibriSpeech or LibriTTS yet. The dataset used is AISHELL-3, based on mandarin. I think it might not be the reason, because attention line should occur after 200k steps training normally.

ZhaZhaFon commented 2 years ago

ECAPA-TDNN works well and show superior results on SV tasks, but its performance is not doomed to be better than dvector or other speaker embeddings when applied to other speaker-related tasks as a speaker representation. At least from my experiments...

First, Thanks for the excellent work by CorentinJ! I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:

tts_schedule = [(2, 1e-3, 10_000, 32), (2, 5e-4, 15_000, 32), (2, 2e-4, 20_000, 32), (2, 1e-4, 30_000, 32), (2, 5e-5, 40_000, 32), (2, 1e-5, 60_000, 32), (2, 5e-6, 160_000, 32), (2, 3e-6, 320_000, 32), (2, 3e-6, 640_000, 32)] speaker_embedding_size = 192

However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.

Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?

ECAPA-TDNN works well and show superior results on SV tasks, but its performance is not doomed to be better than dvector or other speaker embeddings (sometime even i-vector) when applied to other speaker-related tasks as a speaker representation. At least from my experiments...