bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
450 stars 65 forks source link

loss issues encountered in fine-tuning the model #26

Closed wangtao201919 closed 1 year ago

wangtao201919 commented 1 year ago

Hello author, this project is great. I am trying to add some Chinese speeches for fine-tuning, but my ’validation/mel_spec_error‘ has almost stopped decreasing at 15k, and ’training/gen_loss_total' has also increased. I would like to ask if this loss change is normal. Thank you so much. 11

RF5 commented 1 year ago

Hi @wangtao201919

The values of the training loss curves do look a little strange, the generator loss probably should be increasing so much. But it is a little hard to say since the training loss dynamics of GANs can be a bit tricky to interpret. The validation mel spec error also ideally should be decreasing, but again it is a bit hard to tell. My advice is to listen to the samples generated every 20k steps or so and assess whether it sounds better or not.

To give some comparison, here is what the training curves look like for our prematched vocoder trained on librispeech for the first 1M steps:

image

Aside from the generator loss, our plots and yours look fairly similar.

I hope that helps a bit!

wangtao201919 commented 1 year ago

Thank you very much for your response and suggestions. I have tested the fine-tuning model, and the results are very significant. However, I have found that the generalization in cross-lingual conversion is not stable enough, just as you mentioned in your paper, "how far away can the reference utterances be from the training distribution?" I suspect that it might be because the distance between the features of the target language and the reference language cannot be measured as accurately as within the same language.

RF5 commented 1 year ago

Ahh I see. Yeah for languages that are very very different from English, one might need to fine-tune the WavLM encoder as well on that language to allow it to better represent it in the feature space. Without this, you are probably right that the distance comparisons between features of different languages is not as ideal as within the same language.

egorsmkv commented 1 year ago

@RF5 is there any tutorial to fine-tune the WavLM encoder?

RF5 commented 1 year ago

Hi @egorsmkv , unfortunately not that we know of. There might be some useful resources for that in the original microsoft repo, but I don't think they ever open sourced their training code.

EmreOzkose commented 9 months ago

Hi @wangtao201919 , did you fine-tune wavLM for Chinese ? If not, did you obtain good results with just fine-tuning vocoder?