PaddlePaddle / PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
https://paddlespeech.readthedocs.io
Apache License 2.0
10.99k stars 1.83k forks source link

Always the same speaker output in fastspeech2 aishell3 voice conversion #1763

Closed HighCWu closed 2 years ago

HighCWu commented 2 years ago

Describe the bug Always the same speaker output in fastspeech2 aishell3 voice conversion

To Reproduce

  1. PaddleSpeech语音克隆 always output the same speaker.
  2. When I change the synthesizer to Tacotron2, everything works fine, the model can generate different speaker speech.
  3. Here are some outputs I packed: output_sound_fastspeech2.zip output_sound_tacotron2.zip
zh794390558 commented 2 years ago

目前voice clone的效果不好。

HighCWu commented 2 years ago

不是效果好不好的问题,Tacotron2预训练版的能正常clone出不同音色,为什么fastspeech2预训练版的结果全都是同一个人的音色

yt605155624 commented 2 years ago

不是效果好不好的问题,Tacotron2预训练版的能正常clone出不同音色,为什么fastspeech2预训练版的结果全都是同一个人的音色

Thanks for using paddlespeech's voice cloning! Your conclusion is very useful to me and other users.

vc0 (Tacotron2) 's traing steps is less than vc1(fastspeech2) (you can also find this in release models' names and configs), because Tacotron2's training is instable (see training loss in https://github.com/PaddlePaddle/PaddleSpeech/discussions/1434 ), I have early stopped Tacotron2's training,I haven't compared vc0 and vc1, maybe vc1 has over fitting.. You can try to train your own vc1 and try to early stop ~

yt605155624 commented 2 years ago

also you can use the new voiceprint recognition model we release https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0

HighCWu commented 2 years ago

@yt605155624 thanks, I will try to train it by myself. 我会试着训练它,只是你们发布一些预训练模型应该先测试一下效果,vc0的readme里面还建议用vc1,结果vc1提供的预训练模型完全等于没效果😂。 我有空训练出来了会来反馈的。

yt605155624 commented 2 years ago

Looking forward to your feedback, it would be better if you could provide a useful config file or pretrained model

yt605155624 commented 2 years ago

FYI here is my test result for vc: ref_audio.zip

# Randomly generate numbers of 0 ~ 0.2, 256 is the dim of spk_emb
for i in range(10):
    random_spk_emb = np.random.rand(256) * 0.2
    random_spk_emb = paddle.to_tensor(random_spk_emb)
    utt_id = "random_spk_emb" + "_" + str(i)
    with paddle.no_grad():
        wav = voc_inference(am_inference(phone_ids, spk_emb=spk_emb))
    sf.write(
        str(output_dir / (utt_id + ".wav")),
        wav.numpy(),
        samplerate=am_config.fs)
    print(f"{utt_id} done!")

vc_syn_vc0.zip vc_syn_vc1.zip

yt605155624 commented 2 years ago

sorry, I just find we haven't use the random emb by a developer's pr ... https://github.com/PaddlePaddle/PaddleSpeech/pull/1828/files

sixyang commented 2 years ago

确实感觉 vc1 要比 vc0 效果好多了,不过感觉还是有一点电流音,而且语句中间没有停顿感,请问这个可以怎么优化?