OlaWod / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
MIT License
561 stars 102 forks source link

Fine tuning with custom (multilingual) data #82

Open ukemamaster opened 8 months ago

ukemamaster commented 8 months ago

Hi @OlaWod, i appreciate your work.

I am trying to fine tune the FreeVC model with my custom multilingual data (using an already trained speaker encoder model), and without SR augmentation. After some 300k steps (with batch size 32) it gives fair conversion outputs. However, i have some questions:

  1. It seems that unseen-to-seen, and unseen-to-unseen conversions have poor quality. Will adding more and more data to the training set improve these cases?
  2. Is it necessary to train the WavLM and HiFiGAN models with the custom dataset or the pre-trained models are OK to use for custom dataset?
  3. Is it possible to train the FreeVC model using mel-spectrograms directly fed to the Bottleneck Extractor instead of SSL features, (i.e., skipping HifiGAN and WavLM models) ? Have you tried it? Is it worth giving a try?
  4. Does the 24khz training recipe has better performance than the 16khz one?
  5. Does the SR augmentation has a big effect on performance?
  6. Can the conversion process be in real-time? I mean can we convert a source audio frame-by-frame, and not as a whole?

Any other tips that can improve the conversion quality, are appreciated.

Thanks

Xmiler commented 8 months ago

Hi @ukemamaster,

I am new here and will follow this topic with interest. But could you please share the audio samples your model generates to see the quality you have achieved.

Thanks in advance.