coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
31.64k stars 3.78k forks source link

[Feature request] pronounciation, cadence and nuances in XTTS v2... #3764

Open 0wwafa opened 1 month ago

0wwafa commented 1 month ago

Hello! I have used xTTS v2 for a while and made great voices. I sih to know one thing: every voice made, when it "speaks" has the same cadence and pronounciation (clearly from a trained model). How could I get from the speaker also that? I mean, to really clone a voice, you don't need only the frequencies but also their nuances. Can you please post an example or even better, add the feture directly in xTTSv2? So that one can decide if getting a standard voice, a "speaker" voice, or a speaker voice and "nuance". That would be great! Thanks.

0wwafa commented 2 weeks ago

how can I do this manually? can anybody help?

Aphexus commented 1 week ago

I don't think this is entirely true. I put in some text and I had something like "in the butt, yeah in the butt!" and it spoke the last part where it raised the pitch of yeah and said it more excited and made it feel like an exclamation(as if it took into account !).

So there are some nuances. Maybe there should be some way to modify the speech a bit with "special tokens" that can raise or lower the pitch or increase the speed or whatever. I think this would require, for it to work, someone to categorize a training set that way else it likely won't feel natural.

0wwafa commented 1 week ago

@Aphexus lol. yes.. there are.. but they are not the same of the speakers..

like:

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cpu")
t=open('text.txt', 'r').read().replace('\n','')
tts.tts_to_file(text=t, speaker_wav=["./speaker1.wav","./speaker2.wav"], language="en", file_path="test.wav")

no matter how long are the samples or how many, the foning intonation is not as the original even if the voice is similar.