Tokenizer changes - Githubissues

You'll see two scripts. compare_and_merge.py and expand_xtts.py.

I didn't do any integration with alltalk so these scripts are capable of running as is, standalone.

steps to use

Run alltalk finetune and check the bpe tokenizer box to train a new tokenizer during transcription
begin transcription
When transcription is complete you will have a bpe_tokenizer-vocab.json
Open compare_and_merge.py and fill in the file paths for the base model files and the new vocab.
run compare_and_merge.py
You now have an expanded_vocab.json.
Open expand_xtts.py and fill in the file paths
Run expand_xtts.py
You now have an expanded base xttsv2 expanded_model.pth and its pair expanded_vocab.json
The base xttsv2 model needs to be removed from the file path /alltalk_tts/models/xtts/xttsv2_2.0.3/model.pth
The base vocab.json needs to be removed from the file path /alltalk_tts/models/xtts/xttsv2_2.0.3/vocab.json
Place xpanded_model.pth and expanded_vocab.json in the place of the removed base models at path /alltalk_tts/models/xtts/xttsv2_2.0.3/. Rename them to model.pth and vocab.json.
Thats it you can now begin fine tuning as is.

You'll find each file commented with more detail about whats going on. I also switched the script to use a rotating port because when working on cloud instances specifically it's very common that you exit the script and the port stays open for awhile causing an open port issue. If we rotate the ports then it avoids having to manually go in and change the port each time. To bo honest I accidentally pushed with that change in there. feel free to toss it out if its beyond the scope of this PR or not something you wish to include.

erew123 / alltalk_tts

Tokenizer changes #287