OlaWod / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
MIT License
587 stars 106 forks source link

Poor results with: voice_conversion_models--multilingual--vctk--freevc24.zip CoquiTTS #89

Open ballerburg9005 opened 7 months ago

ballerburg9005 commented 7 months ago

At first I was somewhat impressed, using a male voice as source and a female voice as target that was 30 seconds long and noise-cleaned by AI. It pretty much made the source wav sound like a prebubescent boy, similar to the target wav speaker, just not really as feminine as the target wav.

I am quite familiar how to create 30 second clean voice samples, so the voice transfer works very well on the commercial Coquitts website. Which I assume is using a variation of FreeVC.

But after this I tried many many different celebrity voices and such from males (similar or low pitched voices), and it all sounded like the same voice from some dude (who's voice wasn't all that manly) who was in neither of the provided sample wavs. There really was not much if any style transfer going on, if it concerns very basic fundamental parameters like the tone/undertone, pitch, rasp, etc. of the voice, i.e. what makes it recognizable the most. In that respect it sounded 90% just like this same dude all the time (presumably some voice used to train some TTS which the model takes as basis), and 9% like the source wav and 1% like the target wav (but it had all the nuances from the source wav and also would sometimes transfer nuances from the target wav, but only nuances). So you put in Duke Nukem + Duke Nukem, you always get = "this dude", who now speaks with boasty caricative intonations from the source wav (not the target wav), but otherwise his basic voice sounds nothing like Duke Nukem. I sometimes could recognize the orginal source wav's speaker voice stronger than other times, and sometimes the intonation was poor or there were artifacts.

I also noticed that bitrate 48000 works somewhat cleaner than 44100, but nothing else changed the fact, like mono 16k or what.

tts --model_name "voice_conversion_models/multilingual/vctk/freevc24" --source_wav untitled.wav --target_wav=input.wav --out_path=out.wav

Is this a bug, or is this not unusual?

The Coqui version of freevc24 is also substantially larger (1.6GB) and from March 2023 or so.

etlweather commented 2 weeks ago

Observed something quite similar. Were you able to improve the voice transfer?

ballerburg9005 commented 2 weeks ago

No I don't think you can do much of anything about it.