erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
864 stars 98 forks source link

Alltalkbeta #288

Open IIEleven11 opened 1 month ago

IIEleven11 commented 1 month ago

You'll see two scripts. compare_and_merge.py and expand_xtts.py.

I didn't do any integration with alltalk so these scripts are capable of running as is, standalone.

steps to use

  1. Run start_finetune and check the "bpe_tokenizer" box to train a new tokenizer during transcription
  2. Begin transcription
  3. When transcription is complete you will have a bpe_tokenizer-vocab.json
  4. Open compare_and_merge.py and fill in the file paths for the base model files and the new vocab.
  5. Run compare_and_merge.py
  6. You now have an expanded_vocab.json.
  7. Open expand_xtts.py and fill in the file paths
  8. Run expand_xtts.py

You now have an expanded base xttsv2 "expanded_model.pth" and its pair "expanded_vocab.json" The base xttsv2 model needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/model.pth" The base "vocab.json" needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/vocab.json" Place "expanded_model.pth" and "expanded_vocab.json" in the place of the removed base model/vocab at path "/alltalk_tts/models/xtts/xttsv2_2.0.3/". Rename them to "model.pth" and "vocab.json".

That's it you can now begin fine tuning.

You'll find each file commented with more detail about what's going on. Finetune.py had an edit i was using to rotate the port because when using an online instance, when I have to end the script the port can linger blocked. Which causes the script to fail and I have to go in and change the port. So just setting a range from port # - port # fixes that issue. But I removed it as it's beyond the scope of this specific PR. I can send it in another if that's something you want to implement.

IIEleven11 commented 1 month ago

Ignore my finetune.py script changes. I reverted them.

So this solution worked with no slurred speech and no accent with the 2.0.2 model. I believe the accent with the 2.0.3 model was inherent with the base model and not specific to this solution.

You'll see a new custom_tokenizer.py. This script needs a txt file that's run through the extract_dataset_for_tokenizer.py script. This will remove the first and third columns from the csv's. Output will be your new custom datasets vocab.json. Use this with compare and merge script then expand_xtts script and begin training.

As far as the 2.0.3 model. It remains unknown and I fear will always remain that way as Coqui has exited the party. So it might be wise to revert the model from 2.0.3 as default to 2.0.2.

I had to do a lot of learning here with these so I am cautious and open to the possibility i missed something. Especially with the creation of the new tokenizer. So if anyone has anything to point out please do.

erew123 commented 1 month ago

Hi @IIEleven11

Sorry its taken a while to respond, some days Im busy elsewhere and some days I wake up and there's 10+ messages to deal with before I get to even look at anything.

If Im interpreting what you've said correctly, it will work fine on the 2.0.2 model, but 2.0.3 goes a bit funny. The only differences I know of with the 2.0.3 model was that they introduced 2x new languages, which I think were Hungarian and Korean https://docs.coqui.ai/en/latest/models/xtts.html#updates-with-v2

But actually, they added 3x languages. Hindi was added too, but not documented anywhere apart from here https://huggingface.co/coqui/XTTS-v2#languages (that I ever found).

As there is no difference in the training setup that identifies differences between the models (that I know of) would you think that means there would be something different in the config.json or vocab.json that perhaps is the difference that maybe makes 2.0.3 funny to train?

Apologies for the questions Im just digging into the knowledge youve learned and wondering if I can think of anything that may help solve the puzzle.

That aside, thanks for all your work on this! I will test it soon. :)

IIEleven11 commented 1 month ago

Yeah so check https://github.com/coqui-ai/TTS/issues/3309#issuecomment-1828324856. They do acknowledge there was some fall back. Specifically when adding new languages/speakers.

I am curious what would happen if we removed the other than English tokens from the vocab.json. they take up a very large amount of space. I would think it will allow for more English vocabulary and therefore a better English speaking model. Will incur many requests asking for multi lingual support though.

The configs and vocabs for each version of the model are different the 2.0.2 vocab has a smaller size and smaller embedding layer. So they aren't compatible for inference or trainining without adjusting the architecture of the model.

There's a couple of other fine tuning webuis that also default to 2.0.2. Daswers fine tuning webui for example.

But yeah more testing of course. I only used it with a single dataset. I think allowing the community to go at it would be a good solution for now as we can only really confirm with more testing. We are somewhat working blind with whatever information Coqui left behind.

erew123 commented 1 month ago

I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code.

IIEleven11 commented 1 month ago

I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code.

Ahh I did see you comment back then, yeah. The accent within the voice could very well have been an error somewhere on my part. I don't want to remove that from the equation.

The 2.0.3 model has pros and cons. I think it has a greater ability to meet a wider range of people's needs than 2.0.2 because it does have a slightly bigger vocab. But this means it's potential is possibly lesser than 2.0.2.

The big reason I'm hesitant to provide what I did to remove all but English tokens in the vocab.json is because I am not confident that I completely understood all the changes I made. While it did most certainly work, some of it I just said "that looks right" and moved on. Training models is really complex and I just want to make sure I'm not providing code that will give someone a harder time due to my ignorance.

erew123 commented 1 month ago

Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things.

Thanks for the updates above, do you think its now time for me to merge/test this out?

Thanks

IIEleven11 commented 1 month ago

Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things.

Thanks for the updates above, do you think its now time for me to merge/test this out?

Thanks

Yeah I would really love It if another developer would really look into it with me. I've been trying to essentially reverse engineer coqui's code and would love another mind to collaborate with.

I have tested it a few more times since then. Adding vocabulary works as expected.

One thing though. I am trying to add a new special token which is proving to be a bit more nuanced.

I would guess most users don't try and do this though so it shouldn't be a problem for now.

IIEleven11 commented 1 month ago

I also saw you were deep into the conversation at one point in some really old commits. Do you know anything about the loss of ability to prompt engineer the model between tortoise and xtts?

Things like "[joy] it's nice to meet you!" Would generate an emotional joyous sentence. Tortoise can do it. Xttsv2 paid API could do it. But now we can't do it.

This is what I've been trying to solve. It would appear they removed this functionality from the open source versions. And because the tortoise and xtts models are nearly identical I believe we could put the pieces together to get it back.

erew123 commented 1 month ago

Hi @IIEleven11 Spent my morning cleaning up after spilling coffee all over my desk, computer, keyboard, wall, floor etc.... :/ so lost a few hours of my day where I was hoping to respond properly, look into a few things etc. How annoying!

Anyway, first off, I found this conversation earlier https://github.com/coqui-ai/TTS/issues/3704 I wonder if that may be of interest??

As for emotions, I didn't know they HAD implemented them at some point in the past but it must have been on the roadmap according to this https://github.com/coqui-ai/TTS/discussions/3255 and I can see it on the roadmap https://github.com/coqui-ai/TTS/issues/378 as Implement emotion and style adaptation. in the as yet uncompleted "milestones along the way".

To add to all this, eginhard https://github.com/eginhard is currently maintaining TTS and the Coqui scripts. He is not someone whom worked for Coqui (as I understand) he is just passionate about TTS and the Coqui model. He also appears to be doing quite a bit of work on the trainers/finetuning https://github.com/idiap/coqui-ai-TTS/commits/dev/ (yet to be released). Im not sure how involved he may want to be with another project, but, I suspect he knows quite a bit about the trainer and probably knows/has figured out quite a bit about the model....... Maybe he might be a good person for us to ask a few questions to (should he have time). I suppose we could pose any questions there, if you agree that could be a good path?

IIEleven11 commented 1 month ago

Awesome! Thanks for the leads. Yeah that's a good idea.

I did just make a breakthrough though that kind of confirms some of my theories.

I trained an xttsv2 model that can whisper using a custom special token "[whisper]". So I think this means that we can technically make any special token including for emotions.

The only difference being tortoise can just do many emotions and these tokens are nowhere to be found within its vocab.json but yet it knoww exactly how to handle them.

Anyways, so my conclusion with this new tokenizer is if people want to train new vocabulary they need a significant amount of data. 4 or 5 hours only works partially it will lose the ability to say generate certain sounds while gaining the ability to say others. This is negated with more data. It looks like somewhere around 15 to 20 hours give or take would be more ideal.

erew123 commented 1 month ago

Wow! Training it to emote, that's pretty cool!

Re your conclusion though, that sounds similar to what I read about training an entirely new language into the model, without fully training all other languages at the same time. I imagine you need a hell of a lot of compute to build out a base model for this.

erew123 commented 2 weeks ago

Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member.

If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works.

If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so.

I will get to it as soon as I can.

All the best

IIEleven11 commented 1 week ago

Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member.

If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works.

If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so.

I will get to it as soon as I can.

All the best

Oh sorry, actually I have an update for it that solves the model losing the ability to speak specific words. We need to freeze the embeddings layers of the base model prior to training. After I push that to this though, you could merge it but it isn't integrated into your webui. So if anyone wants to use the process they would need to run each script alone. I could maybe work on integrating it with your code, I don't expect it to be too difficult (famous last words). I am just swamped with clients at the moment and am about to release my own personal project. If I can get to it though I will.