JarodMica / ai-voice-cloning

GNU General Public License v3.0
534 stars 115 forks source link

My trained german model sounds very american :D #145

Open Sp4wnf3rk3l opened 1 month ago

Sp4wnf3rk3l commented 1 month ago

Hey Jarod o/ First of all: Thx for your hard work!

I'm running into troubles while generating a german voice and I hope you can help me.

So I'm trying to prepare a german dataset and run training afterwards, with the help of this video -> https://www.youtube.com/watch?v=WWhNqJEmF9M

I did prepare the dataset, did create a bpe tokenizer and ran a succesful training with 500 epochs, but when I try to use the trained model it doesnt sound german at all. It has a harsh american accent.

Am I doing something wrong? Do I have to alter the Tokenizer in someway?

Thx for you help!

*edit In fact I noticed that the training curves just don't look like there is constant progress. They go down in de beginning but stagnate relatively quick. It's not what you described in another video, like they go all they way to the bottom and stay there, instead they just go a bit down and the curve will get very flat. I maxed the learning settings, due to creating a german model... but it doesn't seem like the model getting the german pronounciation right. I will update the dataset with more data now, but im not certain that this is the issue.

TalapMukhamejan commented 1 month ago

Hello. I have been training these models a lot so maybe I can help you. Can you tell me how much data you used for German model. In my case it sounded with English accent in case where tokenizer was default. So maybe you prepared your tokenizer, but forgot to change it. You should reload the website after that. Can you please provide he screenshot of that tab with settings. If you have any other questions, don't hesihesitate to ask.

Sp4wnf3rk3l commented 1 month ago

Hey Talap :) thank you so much for your effort...

I did in fact prepare the tokenizer and loaded it in inside the settings tab (and reloaded the UI).

I use over 4 hours of data by now, but maybe I'll double this for the next try. Is there an easy way to add new files to a data set btw? or do i have to compile it all over again?

Screen Sets

Screenshot 2024-07-27 150222

What I encouter often when I try to start the training are errors like this:

Screenshot 2024-07-27 143357

If I restart the app and try again, sometimes it works, sometimes i doesn't.

TalapMukhamejan commented 1 month ago

The encoding problem might appear because of the letter in German other than English, like this Ö. I faced that type of problem, not on the tortoise, but in another model and changed the encoding type. That's probably because of the encoding on which your text file is saved. You can search for that problem https://stackoverflow.com/questions/11277182/utf8-codec-cant-decode-byte-0x94 Actually, I have never trained a new language on such a small dataset, it's better to use it to fine-tune your language. My starting point was 37 hours up to 1.7k hours. So you should train the model on a larger dataset and then fine-tune on that 4-hour dataset if you have your voice there. About the reason why it might sound American. Since the initial model was trained on a 50k hours dataset trying to fine-tune it on that small one may result in this kind of thing. If you don't want to do that, you train an RVC model and apply it and it should remove that American accent. At least I think so.

Sp4wnf3rk3l commented 1 month ago

Thx for the insights... I really appreciate it.

As I am relatively new to this, I'll try to explain myself a little bit^^

I thought about training a german model from scratch because I couldn't find a good working one, besides the one the comes with tortoise anyway. So i gathered a dataset from only one german voice.

What I get from reading your post is that such a base model should be trained with a reaaaally big dataset from different voices just to get the pronounciation etc. right and apply a RVC afterwards? So in the long run I also could train a model with 100+ hours of different voices and use that one as a base model to train a single voice model?

I was thinking that I had to compile a set out of a single voice to copy it in some way. This would be the way to go if my speaker was english and I could train him on a model that speaks english anyway? or if I would provide a dataset of him that has more hours than the autoregressive model to change the language?

Because I have literally no experience in programming ... besides knowing how to execute a python script ... even reencoding a file because of the special letters seems to be a problem here :D

So again: You think applying an RVC could do the trick? Because I feel like the models I am testing are sounding very robotic and they often don't get the intonation right, even in english. I tested applying RVC's in the beginning but they inherit this issues of the original model. Thats why I wanted to train a model in the first place.

If I take a look at other issues here, this one for example:

https://github.com/JarodMica/ai-voice-cloning/issues/133

It seems like ppl are training with 10h of another language and getting half-decent results... and in Jarods videos it looks like he uses about 10k minutes(?) (I'm not completely certain right now) and his model speaks vietnamese perfectly :D I want it that way^^

Thx again! Have a great day!

Mrfrize1 commented 1 month ago

hi. I'm also new to this and I'm writing a comment not exactly on the topic, but I read that the author of the problem used his own tokenizer (if I understood correctly) and I would like to get a short description of how to make it. I'm just learning the model in Ukrainian and there are completely different letters and even pronunciation. So: is it enough to just click in the turtle itself “create tokenizer” after entering the code language iso639, or do I need to somehow rewrite it in the file or something like that? And if you have to do it yourself, please write how, I saw something similar in some video from Jarod, but I didn't quite understand it

Sp4wnf3rk3l commented 1 month ago

Tokenizer is made by clicking on create BPE-Tokenizer after creating the dataset ;) its just one click. But since you are ukrainian and you are using the cyrillic alphabet, this could be quite a challenge. If you are able to write it in latin however, this could be half as complicated.

This video will tell you what you need to know, in just creating the tokenizer file:

https://www.youtube.com/watch?v=WWhNqJEmF9M

This one how to create a tokenizer for other non-latin languages, buts thats to much for me to comprehend, too :D

https://www.youtube.com/watch?v=crQBvdurQCY

theslipperyCarrot commented 1 month ago

Hi! I'm new to the game, but very interested. I don't have much idea about it, but I feel like slowly working my way into it.

If I have understood correctly, first of all a good german Tortoise voice is needed as a basis and then your own voice can be trained as an RVC. Later you use the Tortoise voice in combination with the specially trained / cloned voice!?

There is a good German TTS voice by Thorsten Müller. This was trained with Coqui TTS and Piper TTS. They are available in.onnx and.json (Piper) formats as downloads. The voice can be found under: https://www.thorsten-voice.de/kostenloses-deutsches-text-to-speech-tts/ The download is available under: https://huggingface.co/rhasspy/piper-voices/tree/v1.0.0/de/de_DE/thorsten/high Its only about 114MB of size. This is a voice that is made for small devices like the raspberry pi. Maybe the voices that are trained with Coqui (VITS and DDC?) are better. But i could not find a download for them. Is it possible to import or use this voice or otherwise convert it or something?

And then I wonder if you need a female tortoise voice as a basis for the female RVC voices? Best regards! Jonas

Sp4wnf3rk3l commented 1 month ago

thx for sharing :) but with that crappy thorsten model you wont get so far :D tested it... has even a hard time pronouncing ch in front of words. better to build your one with xttsv2 for example. much easier, faster and better results!

when jarod has his new styletts webui up and running... i will switch to that... looks very promising, while at the moment still lacking easy multi-language-support, it is doable and seems to sound soooooo nice.

*nearly forgot: yes :D normaly building up upon a german model would be the way to go, but there is nonen. and nonetheless with a big enough dataset 10000k hours or so... you can train even the english tortoise model to speak german.

If you want to train just an RVC you wont need much data nor a german model... you can train it easily with the RVC-WebUI for example.

My recently trained german model with XTTSv2 and RVC ->

https://vocaroo.com/17ojhxS8kRdD

not perfect... but not half bad i would say :D

theslipperyCarrot commented 1 month ago

This sounds pretty good for me! If i could get a voice in that quality, i would be happy for now.

XTTSv2 seems to have no Web-UI, right? I have no idea from coding sadly. Is it easy to use for noobies? "by using just a quick 6-second audio clip" sounds a little to good for me!? Its this, right?: https://huggingface.co/coqui/XTTS-v2

I've had a look on RVC-Web-UI (https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) a while ago. But i thought without a good german Tortoise-Voice a RVC-Voice would be useless. And there is a lot of chinese on the git. And the Training can also be done in the AI-Voice-Cloning v3 Web-UI, right?

After playing around with Stable Diffusion for Pictures i thought cloning a voice would be easy. Oh boy, it's not!

theslipperyCarrot commented 1 month ago

I have searched a little bit and will give this three WebUIs a try when i have some time for it.

XTTS WebUI: https://github.com/aitrepreneur/xtts-webui?tab=readme-ov-file

This works also with XTTSv2.

XTTS Finetune WebUI: https://github.com/aitrepreneur/xtts-finetune-webui

XTTS RVC WebUI: https://github.com/aitrepreneur/XTTS-RVC-UI

I wonder if this works also for german Voices.

Sp4wnf3rk3l commented 1 month ago

Sry for the late answer... first work day... not as motivated as I want to :D

I'm using his builds -> https://github.com/daswer123 but you'll need his finetuning release also, if you want to train your xtts model.

You can use only a 6 second clip and it kinda works, but using more data will help with the quality and precision. the example from my last post was made with approx. 10 mins to clone the voice I wanted, but i did apply an RVC of the same voice also, which was trained for 100 epochs with about 6h of data.

Voice cloning will be easy... in 1 or 2 years, if the hype is over, but at the moment companies want to make money out of it. Look at the unbelievable quality of elevenlabs for example. With voice cloning it is the thing, that opensource ai's didn't went through the roof here like art ai's, like stable diffusion did for example. Art ai's improved so damn fast in the last years, but voice cloning is still to come.

Have fun testing o/

*if you have some time on your hands you could also try using ChapGPT and coding your one WebUI... in fact, it is much easier than you would think. GPT does all the work and it does it well. With literally NO experience in python i managed to put up my own little webUI, in literally half an hour. By now I only managed to get StyleTTS generating output and the possibility to apply RVC. But dude... HALF A FREAKING HOUR! :D

theslipperyCarrot commented 1 month ago

O.K., creating your own Web-UI in half an hour is impressive! Even if it still has limited functions.

I invested about an hour and a half yesterday to install the three Web-UIs and it didn't work for any of them! ;( I think it was due to ffmpeg or cuda. I have to look at it again and try again. But there were also many conflicts over dependencies. Let me see!

I think Jarod's Web-UI is definitely the best so far, from what I've looked at. You only need a TTS voice if you want to create a German voice. Thad is sad.

theslipperyCarrot commented 1 month ago

Yesterday I experimented a bit with different WebUIs to create a finetune model and an RVC voice. I am successful with the Finetune model, but not with the RVC-voice so far.

Since I didn't know where to store the finetune model and the config.json and what settings to make in the ai-voice-cloning-WebUi, I just played around a bit here as well.

Now my WebUi won't start anymore! :( I think there is a problem with the config.json file, because the error says: "Exception: expected , or } at line 2 column 18"

But when i delete the config.json file from C:\KI\ai-voice-cloning-3.0\models\tokenizers, i get the Error: "FileNotFoundError: [Errno 2] No such file or directory: './models/tokenizers/config.json'"

Can I delete/reload/start any data to reset the WebUi to default? Or what can i do else? Thanks!

Sp4wnf3rk3l commented 1 month ago

Just delete the config ;)

Yahoo Mail: Suchen, organisieren, erobern

Am Do., Aug. 8, 2024 at 20:56 schrieb @.***>:

Yesterday I experimented a bit with different WebUIs to create a finetune model and an RVC voice. I am successful with the Finetune model, but not with the RVC-voice so far.

Since I didn't know where to store the finetune model and the config.json and what settings to make in the ai-voice-cloning-WebUi, I just played around a bit here as well.

Now my WebUi won't start anymore! :(

Can I delete/reload/start any data to reset the WebUi to default? Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

theslipperyCarrot commented 4 weeks ago

I had found another solution. But just deleting the config.json was not enough.

Every time I wanted to enter voices, it shot the WebUI to pieces. I think the problem was that I had created the finetune model with XTTS. That doesn't seem to be compatible.

My solution was: delete config.json, copy working *.json and rename it to config.json, launch WebUI and change settings, and then delete the new config.json.

After a lot of experimentation, I slowly have the impression that Tortoise is not the right choice for a German voice. XTTS just seems to be doing better. The model sounds better even before fine-tuning.

blastbeng commented 3 weeks ago

Does anyone ever solved this? I have read the comments, but i don't understand how to install another UI to change the generator language.

I have trained thousand of hours of Italian language, but this tool still speak my sentences in American, even with correct tokenizer

theslipperyCarrot commented 1 week ago

I invested a lot of time to create a German Tortoise-TTS voice. The results were sobering until the end! However, I got very good results with XTTS and RVC. Sometimes individual parts of words or words are swallowed. Maybe I'll find a solution for that. But most of the sentences come out perfectly! It's a real shame because the WebUI of @Jarod is the best! And the Audiobook-maker is awesome! I haven't found anything like this for XTTS yet.

blastbeng commented 1 week ago

I invested a lot of time to create a German Tortoise-TTS voice. The results were sobering until the end! However, I got very good results with XTTS and RVC. Sometimes individual parts of words or words are swallowed. Maybe I'll find a solution for that. But most of the sentences come out perfectly! It's a real shame because the WebUI of @jarod is the best! And the Audiobook-maker is awesome! I haven't found anything like this for XTTS yet.

I am trying to do the same with Bark and RVC