I am trying to fine tune the FreeVC model with my custom multilingual data (using an already trained speaker encoder model), and without SR augmentation. After some 300k steps (with batch size 32) it gives fair conversion outputs. However, i have some questions:
It seems that unseen-to-seen, and unseen-to-unseen conversions have poor quality. Will adding more and more data to the training set improve these cases?
Is it necessary to train the WavLM and HiFiGAN models with the custom dataset or the pre-trained models are OK to use for custom dataset?
Is it possible to train the FreeVC model using mel-spectrograms directly fed to the Bottleneck Extractor instead of SSL features, (i.e., skipping HifiGAN and WavLM models) ? Have you tried it? Is it worth giving a try?
Does the 24khz training recipe has better performance than the 16khz one?
Does the SR augmentation has a big effect on performance?
Can the conversion process be in real-time? I mean can we convert a source audio frame-by-frame, and not as a whole?
Any other tips that can improve the conversion quality, are appreciated.
I am new here and will follow this topic with interest. But could you please share the audio samples your model generates to see the quality you have achieved.
Hi @OlaWod, i appreciate your work.
I am trying to fine tune the FreeVC model with my custom multilingual data (using an already trained speaker encoder model), and without SR augmentation. After some 300k steps (with batch size 32) it gives fair conversion outputs. However, i have some questions:
Any other tips that can improve the conversion quality, are appreciated.
Thanks