GANtastic3 / MaskCycleGAN-VC

Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.
MIT License
110 stars 31 forks source link

Much worse results than in paper #3

Closed terbed closed 3 years ago

terbed commented 3 years ago

Hi,

First of all, thank you for this nice implementation.

I trained the network with default settings and data (~500k iteration), but the results are really unnaturalistic (eg.: link) and far from the samples provided by the author of the paper: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

Why is this? Did you experience the same or you got nice results?

hikaruhotta commented 3 years ago

Hi @terbed , thanks for creating this issue. We found that the quality of conversion depended on heavily on the pairs of speakers selected. It worked better between speakers of the same gender. However, we saw large improvements over CycleGAN-VC3 and CycleGAN-VC2.

todalex commented 3 years ago

It look's like it stuck's in a local optimum and after that the generator loss will not improve any suggestion for escape local optimum ?

terbed commented 3 years ago

@todalex Maybe cyclical learning rate schedulers? https://pytorch.org/docs/master/generated/torch.optim.lr_scheduler.CyclicLR.html

hikaruhotta commented 3 years ago

Thank you @todalex and @terbed for your suggestions. May I ask which speakers both of you are converting between?

i trained a model for conversion between VCC2SF3 and VCC2TF1 and the results were generally comparable to the paper's results after 3500 epochs (~260,000 iterations). They can be seen found: https://github.com/GANtastic3/MaskCycleGAN-VC/tree/main/audio_samples/VCC2SF3_VCC2TF1

The original paper's authors also convert between these speakers http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

There is a possibility that conversion works better between some speakers and this would be very interesting to investigate.

hikaruhotta commented 3 years ago

@todalex Since the generator's loss is adversarial, you would expect the loss curve to plateau while the generator and discriminator continue to compete against each other to improve the conversion process. Unfortunately, generator loss convergence is not a good metric to determine when training concludes. Could you try training for longer and sharing your results?

todalex commented 3 years ago

@HikaruHotta dear Hikaruhotta i am using my own dataset for persian speakers and since epoch 700 my g_loss is 7.5 i continued training till epoch 2400 and sadly the generator loss did not changed

hikaruhotta commented 3 years ago

@todalex The generator should continue to improve even if the generator loss stays the same at 7.5 or increases. Could you attach an audio sample an audio sample at epoch 700 and 2400 of the ground truth and the converted audio? Could you also attach the training curves for both generators and discriminators?

pavelxx1 commented 3 years ago

Hi to all I am also train on my own dataset and result is https://prnt.sc/126uirv

for good results - it necessary that the generator losses be less than 1?

terbed commented 3 years ago

Hi,

First of all, thank you for this nice implementation.

I trained the network with default settings and data (~500k iteration), but the results are really unnaturalistic (eg.: link) and far from the samples provided by the author of the paper: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

Why is this? Did you experience the same or you got nice results?

In my case, the performance of the neural vocoder (Melgan) limited the synthesized voice. In the article, they might use different melgan weigths which could better capture the speaker voice.

pavelxx1 commented 3 years ago

Hi, First of all, thank you for this nice implementation. I trained the network with default settings and data (~500k iteration), but the results are really unnaturalistic (eg.: link) and far from the samples provided by the author of the paper: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html Why is this? Did you experience the same or you got nice results?

In my case, the performance of the neural vocoder (Melgan) limited the synthesized voice. In the article, they might use different melgan weigths which could better capture the speaker voice.

You can use HiFi vocoder its better than MelGan

pavelxx1 commented 3 years ago

@pavelxx1 https://prnt.sc/126uirv seems to be a dead link

link is ok check your internet

ps. after 3000 epoch till 5000 I have same plateau g-loss 8.0-8.5 :(

and results is worse after inference (testing)

terbed commented 3 years ago

What is your problem with this? I think it is OK. From 10k iteration there is a drop, because the identity loss part is eliminated (and also lr scheduler started)

Pavel @.***> (időpont: 2021. ápr. 28., Sze, 10:43) ezt írta:

Hi to all I am also train on my own dataset and result is https://prnt.sc/126uirv http://url

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GANtastic3/MaskCycleGAN-VC/issues/3#issuecomment-828271034, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJCETJJQMWA7LVAXNUX2W3TK7DDJANCNFSM4247CXTA .

todalex commented 3 years ago

@HikaruHotta dear Hikaru i changed my data into same gender as you said it would have better resualt's but i have the same problem and g_loss is not getting better here is my tensorboard : https://tensorboard.dev/experiment/3QJKZ0ZjQQa0dw5TMoLRbw/ and here is converted output of epoch 2000: https://drive.google.com/drive/folders/1-8xE4AvkjSMr2h3_lPrNToHmtVyiCluN?usp=sharing

hikaruhotta commented 3 years ago

@todalex It's great that you're experimenting with new datasets to determine the robustness of MaskCycleGAN-VC.

@todalex The generator should continue to improve even if the generator loss stays the same at 7.5 or increases. Could you attach an audio sample an audio sample at epoch 700 and 2400 of the ground truth and the converted audio? Could you also attach the training curves for both generators and discriminators?

As mentioned above, g_loss does not always monotonically decrease when training GANs because you have two models competing against each other. The optimization problem changes every time the generator or discriminator is updated. Sometimes g_loss goes up and the generator would still improve.

Here is a stackoverflow link on how to interpret GAN losses: https://stackoverflow.com/questions/42690721/how-to-interpret-the-discriminators-loss-and-the-generators-loss-in-generative

I would encourage you to train your model for longer since it's probably still learning. Your model is definitely not suffering from mode collapse. Do you have audio samples from the real dataset? It could be helpful to diagnose what could be going wrong.

todalex commented 3 years ago

@HikaruHotta thank you for the link about GAN losses it was very good for me this is my dataset of same gender (male) and i split into two parts train/s1 eval/s1 and train/s2 eval/s2: 4/1AY0e-g6GjgqzHZ_j0mMsKusCQ5fcite0EbMNlN26v3D1gLnMx2TVnTOOczs

as you said I will continue training because the d_loss is Descending but i think the model has learned well enough and reached to some optimum but the output is no good how can i change the learning rate and reach to a good learning rate for my data ?so i try training with new learning rate again and see if the resault is any better

pavelxx1 commented 3 years ago

@HikaruHotta, If I want use Mel_gan vocoder for non English dataset Must I train MelGan vocoder from scratch on my dataset also? thx

hikaruhotta commented 3 years ago

@HikaruHotta thank you for the link about GAN losses it was very good for me this is my dataset of same gender (male) and i split into two parts train/s1 eval/s1 and train/s2 eval/s2: 4/1AY0e-g6GjgqzHZ_j0mMsKusCQ5fcite0EbMNlN26v3D1gLnMx2TVnTOOczs

as you said I will continue training because the d_loss is Descending but i think the model has learned well enough and reached to some optimum but the output is no good how can i change the learning rate and reach to a good learning rate for my data ?so i try training with new learning rate again and see if the resault is any better

You can modify the --lr (learning rate) argument as shown here: https://github.com/GANtastic3/MaskCycleGAN-VC#training

pavelxx1 commented 3 years ago

@HikaruHotta, If I want use Mel_gan vocoder for non English dataset Must I train MelGan vocoder from scratch on my dataset also? thx

@pavelxx1 Seeing that the melGAN vocoder was trained on https://keithito.com/LJ-Speech-Dataset/, I would expect that it doesn't model the complexities of other languages. I suggest that you take a look at Universal MelGAN which seems to work across multiple languages.

thx but if I want use MelGan for my lang(UA) - I must train a vocoder also?

terbed commented 3 years ago

No, the vocoder is language independent, I am using it for Hungarian language, and works quite well.

On 2021. May 3., at 0:15, Pavel @.***> wrote:

@HikaruHotta https://github.com/HikaruHotta, If I want use Mel_gan vocoder for non English dataset Must I train MelGan vocoder from scratch on my dataset also? thx

@pavelxx1 https://github.com/pavelxx1 Seeing that the melGAN vocoder was trained on https://keithito.com/LJ-Speech-Dataset/ https://keithito.com/LJ-Speech-Dataset/, I would expect that it doesn't model the complexities of other languages. I suggest that you take a look at Universal MelGAN https://kallavinka8045.github.io/icassp2021/ which seems to work across multiple languages.

thx but if I want use MelGan for my lang(UA) - I must train a vocoder also?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GANtastic3/MaskCycleGAN-VC/issues/3#issuecomment-830915341, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJCETN5MNUQNYMAW6DBP73TLXFG5ANCNFSM4247CXTA.

terbed commented 3 years ago

My performance issue on male speaker is solved with #7 , the bad performance can be attributed to the fact that the spectrogram was scaled with the female statistics.

todalex commented 3 years ago

@HikaruHotta what I meant is pleas help me find a good lr number I don't know I should change it to what number?current is 5e-4 .can i change the lr in the 3000 epoch for example to escape the local optimum?