Not quite sure how to describe. Gibberish voice while finetuning a second time for the next voice.

RenNagasaki commented 4 months ago

🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration. diagnostics.log

Describe the bug Not sure if it's a bug, but after the latest update I can't get a double finetuning to work. So before I could finetune with more than 1 voice so the model could do more than one. But now whenever I try that I get stuff like that as a result: gibberish.zip

To Reproduce Finetune with one type of voice. Save the model and try to finetune that existing one.

Text/logs Log's dont show any errors.

Desktop (please complete the following information): AllTalk was updated: a few days ago. Custom Python environment: no

RenNagasaki commented 4 months ago

As I said in the main post, I'm not sure if its related to alltalk, but thats the only thing I remember changing.

I just hope you have any ideas/experiences with that.

erew123 commented 4 months ago

Hi @RenNagasaki

Im not sure how far I will get to look into this at this late hour, but, just so I am 100% on board with what you are telling me...

You have voice samples for person A and you trained the model on person A.
You've then saved the model, and got voice samples for person B and trained the already trained model on person B.

Correct so far?

And when you now generate TTS you get a strange garbled output? Is that for any reference sample file you use? or just reference sample files that were from Person A or B or both?

Thanks

RenNagasaki commented 4 months ago

Hi @RenNagasaki

Im not sure how far I will get to look into this at this late hour, but, just so I am 100% on board with what you are telling me...

You have voice samples for person A and you trained the model on person A.

You've then saved the model, and got voice samples for person B and trained the already trained model on person B.

Correct so far?

Yes

And when you now generate TTS you get a strange garbled output? Is that for any reference sample file you use? or just reference sample files that were from Person A or B or both?

Thanks

Good question. I only ever tried the new voice. Will give that a try next.

Only had a short while to test, but it seems to happen each time a second voice gets introduced. No matter which voice I try first.

This for sure didn't happen before.

RenNagasaki commented 4 months ago

Tried to train the second voice (B in your case) who failed from the base xtts2 mode. Worked flawlessly. Now trying to Train A on top.

erew123 commented 4 months ago

I've just run a very quick test and had a similar results (only did 2x epochs per voice). It might be something to do with the weights getting mangled on a second train. Let me scratch my head a bit on this one and have a think.

RenNagasaki commented 4 months ago

What's confusing me is that it worked before your last patch. But that didn't even touch that stuff, or did it?

erew123 commented 4 months ago

Well, I couldnt sleep, so I managed to run a good few training sessions and test. It was related to the extraction of the dvae weights on the final copy of the model, it must have been corrupting the file somehow. So first off, Ive corrected that and updated finetune.py (I've also added a field for naming the speaker you are training, so that it puts it into your CSV files)

As for your round 1 finetuned models, copy the dvae.pth from the base model folder, over the top of the dvae.pth on the round 1 finetuned model. The previously finetuned model should now train correctly on the 2nd round of training.

Apologies for the cockup!

RenNagasaki commented 4 months ago

Oh, wow. Thanks for being so fast! If I could. I would give you a kiss. 😍

RenNagasaki commented 4 months ago

Regarding the speakername. Does the voice file used to inference this speaker has to have the same name? Or is the speaker name only needed to tell the training its a different voice to be trained?

erew123 commented 4 months ago

With the XTTS model it wont make any difference, with a caveat I will explain within a minute. I've more put this box there for other types of models that you will be able to train in future (VITS etc).

When you train the XTTS model, we are training it with d_vectors (is what they call it). XTTS is a multi-speaker model (meaning it can do generate OR store more than 1x voice). As such you use the multi speaker training method and either d_vectors OR a speaker embedding layer https://docs.coqui.ai/en/latest/training_a_model.html#multi-speaker-training

Both the above methods do pretty much the same thing to the model, but a speaker embedding layer (not to get into too finer detail) embeds a wav file into the model (well the speaker.pth as I understand). Because speaker embedding needs to reference a name for the embedded wav, you would train with a speaker name, which you can later call on when generating TTS. https://github.com/coqui-ai/TTS/discussions/1171#discussioncomment-2088588

So the caveat is that if you use speaker embedding layers, then speaker name matters and it doesnt otherwise, however, the end result is the same.

Neither the training or the TTS generation in AllTalk will use speaker embedding layers on the XTTS models. For a multitude of reasons, one being, there is no way to simply extract out a list of names from the Speaker Embedding file programmatically, So if we start pushing finetuned names in the speaker embedding files, I cant pull out the names in that file and present them to people easily through the interface and there's no easy way to track them. Other model types however will need this feature for training.

Its a long a deep technical explanation/discussion from this point on, however hopefully that gives you enough of an answer.

Thanks

RenNagasaki commented 4 months ago

AHhh, I understand. Thank you for clarifying. Everything works now like a charm! :D

erew123 / alltalk_tts

Not quite sure how to describe. Gibberish voice while finetuning a second time for the next voice. #197