Poor results with: voice_conversion_models--multilingual--vctk--freevc24.zip CoquiTTS

ballerburg9005 commented 10 months ago

At first I was somewhat impressed, using a male voice as source and a female voice as target that was 30 seconds long and noise-cleaned by AI. It pretty much made the source wav sound like a prebubescent boy, similar to the target wav speaker, just not really as feminine as the target wav.

I am quite familiar how to create 30 second clean voice samples, so the voice transfer works very well on the commercial Coquitts website. Which I assume is using a variation of FreeVC.

But after this I tried many many different celebrity voices and such from males (similar or low pitched voices), and it all sounded like the same voice from some dude (who's voice wasn't all that manly) who was in neither of the provided sample wavs. There really was not much if any style transfer going on, if it concerns very basic fundamental parameters like the tone/undertone, pitch, rasp, etc. of the voice, i.e. what makes it recognizable the most. In that respect it sounded 90% just like this same dude all the time (presumably some voice used to train some TTS which the model takes as basis), and 9% like the source wav and 1% like the target wav (but it had all the nuances from the source wav and also would sometimes transfer nuances from the target wav, but only nuances). So you put in Duke Nukem + Duke Nukem, you always get = "this dude", who now speaks with boasty caricative intonations from the source wav (not the target wav), but otherwise his basic voice sounds nothing like Duke Nukem. I sometimes could recognize the orginal source wav's speaker voice stronger than other times, and sometimes the intonation was poor or there were artifacts.

I also noticed that bitrate 48000 works somewhat cleaner than 44100, but nothing else changed the fact, like mono 16k or what.

tts --model_name "voice_conversion_models/multilingual/vctk/freevc24" --source_wav untitled.wav --target_wav=input.wav --out_path=out.wav

Is this a bug, or is this not unusual?

The Coqui version of freevc24 is also substantially larger (1.6GB) and from March 2023 or so.

etlweather commented 3 months ago

Observed something quite similar. Were you able to improve the voice transfer?

ballerburg9005 commented 3 months ago

No I don't think you can do much of anything about it.

gukush commented 1 month ago

I encountered same issue. Are there any similar models that you found that do work?

ballerburg9005 commented 4 weeks ago

I believe little open source models exist that can do one-shot voice-to-voice transfer (one has to be careful when searching: "speech-to-speech" nowadays means various other things if connected to LLMs, it is not related to voice-to-voice, but some people call voice-to-voice speech-to-speech, which is annoying).

I think most of the effort has been done in TTS models. For TTS I can recommend openedai-speech + xtts v2, which just does a very good job at TTS. But there is also F5-TTS, which is sort of experimental. What's special about that is that it transfers the emotions from the voice sample very accurately. XTTS will sound a little this and that, if that makes sense in the context of the input text - it will always stay fairly neutral though. But F5-TTS will yell or cry 100% of the time, if that's part of the input sample, in almost exactly the same way. Last I checked F5-TTS was sort of clunky though, and it mispronounces things sometimes ... it is rare ... but still XTTS works near flawlessly, so it can feel like a downgrade.

This is all TTS though, not voice-to-voice.

gukush commented 4 weeks ago

Thanks for the

I believe little open source models exist that can do one-shot voice-to-voice transfer (one has to be careful when searching: "speech-to-speech" nowadays means various other things if connected to LLMs, it is not related to voice-to-voice, but some people call voice-to-voice speech-to-speech, which is annoying).

I think most of the effort has been done in TTS models. For TTS I can recommend openedai-speech + xtts, which just does a very good job at TTS. But there is also F5-TTS, which is sort of experimental. What's special about that is that it transfers the emotions from the voice sample very accurately. XTTS will sound a little this and that, if that makes sense in the context of the input text - it will always stay fairly neutral though. But F5-TTS will yell or cry 100% of the time, if that's part of the input sample, in almost exactly the same way. Last I checked F5-TTS was sort of clunky though, and it mispronounces things sometimes ... it is rare ... but still XTTS works near flawlessly, so it can feel like a downgrade.

This is all TTS though, not voice-to-voice.

Thanks, unfortunate that zero shot ooen source voice cloning seems to be rather rare.

OlaWod / FreeVC

Poor results with: voice_conversion_models--multilingual--vctk--freevc24.zip CoquiTTS #89