auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
983 stars 207 forks source link

Does anyone reproduce the sound quality in the demo page? #20

Open WeiLi233 opened 5 years ago

WeiLi233 commented 5 years ago

Could any researchers share some advice or experience in the reproducing procedure? For me it's really difficult to reproduce the sound quality in the demo page, and I am very confused which link broken.

iyah4888 commented 4 years ago

Even I have spent a lot of time to reproduce it, I cannot either. But with VoxCeleb2 dataset (in-the-wild data with around 6000 speakers), not VCTK. The network is learned well for auto-encoding but not generalizable to voice conversion. Any comment?

xuexidi commented 3 years ago

I tried for several times,the voice qulity stuck in a sad point whether VCTK dataset or my own chinese speech dataset.........so sad.....

Trebolium commented 3 years ago

I was able to produce audio that comprised of 'ghostly' voices after 100k iterations. There was however a lot of noise. Have either of you @WeiLi233 @xuexidi been able to train models that perform similarly to the pretrained model provided? I am concerned I have been doing something wrong as my model produces audio with a lot of noise, and very poor in comparison to the the audio examples provided in the paper.

JiachuanDENG commented 3 years ago

First of all, I want to thank the author to make the code public. The code is neat and friendly to read.

But sadly, the model doesn't perform well enough as I expected after I listen to their public demo.

I tried to use the public code and the pre-trained model to inference on some waveform files provided by the author in the wavs directory. But sadly, the model can not even produce satisfying results in these samples. Although the mel-spectorgram looks good, if you listen to the output audio, you can't even understand the content. I also tried to test for unseen speakers by recording myself's voice with my computer's microphone, it doesn't work either. The model's output sounds more like a noise than a voice. I guess it is because the model is not robust enough, since my recorded audio's spectrogram looks different from VCTK dataset because different recording devices were used.

auspicious3000 commented 3 years ago

The pre-trained model is for demonstration purposes only. The model should perform well after careful re-training. As far as I know, someone has made a voice conversion phone app for mandarin Chinese based on this model.

JohnHerry commented 3 years ago

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model?

followings are training logs and one of the final converted wavform for test.

50W 100W result_wav

Trebolium commented 3 years ago

Check that the tensors are the same shape before computing their loss?

On Wed, Jan 27, 2021 at 11:15 AM JohnHerry notifications@github.com wrote:

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model?

followings are training logs and one of the final converted wavform for test.

[image: 50W] https://user-images.githubusercontent.com/8011802/105983486-cb4b2e80-60d3-11eb-8190-9713ddad0d4c.jpg [image: 100W] https://user-images.githubusercontent.com/8011802/105983507-d1d9a600-60d3-11eb-8e0b-bbe22763b3e0.jpg [image: result_wav] https://user-images.githubusercontent.com/8011802/105983515-d605c380-60d3-11eb-9d62-10826b4009ec.jpg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/auspicious3000/autovc/issues/20#issuecomment-768215342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH2VQJOCB2S7L6OQZ3TS37YVZANCNFSM4IMU6OHQ .

xuexidi commented 3 years ago

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model?

followings are training logs and one of the final converted wavform for test.

50W 100W result_wav

@JohnHerry English:the input tensor shape of loss function is missmatch, i think it is a bug from the source code, the sound quality could be improve a liitle if you recorrect the input tensor shape of loss function ~ Chinese:它的源码里的loss函数的输入tensor的尺寸不一样,导致loss的G/loss_cd项很快就收敛到0.0001,这其实意味着模型压根就学不到东西,合成不出任何语音。我尝试过修改loss函数的输入张量,使得其输入tensor的尺寸一致,重新训练的时候发现loss的各个分量下降速度都比较正常,最终也能出来人声,但是音质不是很好,估计是我没有重新训练wavenet的原因吧。

xuexidi commented 3 years ago

Check that the tensors are the same shape before computing their loss? On Wed, Jan 27, 2021 at 11:15 AM JohnHerry @.***> wrote: I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model? followings are training logs and one of the final converted wavform for test. [image: 50W] https://user-images.githubusercontent.com/8011802/105983486-cb4b2e80-60d3-11eb-8190-9713ddad0d4c.jpg [image: 100W] https://user-images.githubusercontent.com/8011802/105983507-d1d9a600-60d3-11eb-8e0b-bbe22763b3e0.jpg [image: result_wav] https://user-images.githubusercontent.com/8011802/105983515-d605c380-60d3-11eb-9d62-10826b4009ec.jpg — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH2VQJOCB2S7L6OQZ3TS37YVZANCNFSM4IMU6OHQ .

I think you are right, I realized this bug a few months ago.....

JohnHerry commented 3 years ago

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model? followings are training logs and one of the final converted wavform for test. 50W 100W result_wav

@JohnHerry English:the input tensor shape of loss function is missmatch, i think it is a bug from the source code, the sound quality could be improve a liitle if you recorrect the input tensor shape of loss function ~ Chinese:它的源码里的loss函数的输入tensor的尺寸不一样,导致loss的G/loss_cd项很快就收敛到0.0001,这其实意味着模型压根就学不到东西,合成不出任何语音。我尝试过修改loss函数的输入张量,使得其输入tensor的尺寸一致,重新训练的时候发现loss的各个分量下降速度都比较正常,最终也能出来人声,但是音质不是很好,估计是我没有重新训练wavenet的原因吧。

It works, thank you very much!

kingofview commented 3 years ago

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model? followings are training logs and one of the final converted wavform for test. 50W 100W result_wav

@JohnHerry English:the input tensor shape of loss function is missmatch, i think it is a bug from the source code, the sound quality could be improve a liitle if you recorrect the input tensor shape of loss function ~ Chinese:它的源码里的loss函数的输入tensor的尺寸不一样,导致loss的G/loss_cd项很快就收敛到0.0001,这其实意味着模型压根就学不到东西,合成不出任何语音。我尝试过修改loss函数的输入张量,使得其输入tensor的尺寸一致,重新训练的时候发现loss的各个分量下降速度都比较正常,最终也能出来人声,但是音质不是很好,估计是我没有重新训练wavenet的原因吧。

你好,请问最终loss_id能够训练到多少啊,我训练的模型 loss_id在0.001左右就下不去了。

Trebolium commented 3 years ago

I think my loss_cd went down to 0.002/0.001, but the loss_id wouldn't get that low because we are inferring from a bottleneck after all, meaning the reconstruction of the spectrogram will never be perfect. The question is how good do your reconstructed mel spectrogram sound?

JohnHerry commented 3 years ago

I think my loss_cd went down to 0.002/0.001, but the loss_id wouldn't get that low because we are inferring from a bottleneck after all, meaning the reconstruction of the spectrogram will never be perfect. The question is how good do your reconstructed mel spectrogram sound?

We had tested the speaker similarity of GT audio y and generated audio y' from this model. With Cosine( SpeakerEmbedding(y), SpeakerEmbedding(y')) [The speaker embedding model is a third party pretrained model]. Result values of this AutoVC model are between 0.3 and 0.6, better for seen speaker and bad for unseen. By contrast, Result values from ESPnet seq2seq VC model are between 0.93 and 0.95.

So genearted audio from this AutoVC model are not good enough in both naturalness and similarity. But, AutoVC model is very simiple, and quick, and It support zero-shot unseen speaker tts.

dragen1860 commented 2 years ago

The pre-trained model is for demonstration purposes only. The model should perform well after careful re-training. As far as I know, someone has made a voice conversion phone app for mandarin Chinese based on this model.

Could you kindly tell me which team or github repo made the mobile voice conversion app? thank you.