vocab.json file missing

johnbenac commented 8 months ago

Please generate a diagnostics report and upload the "diagnostics.log".

Describe the bug After running step 2, the vocab.json file is absent, despite the program not catching or raising any errors, and telling the user that they are ready for step 3. The vocab.json file is not in any of the files, let alone prepopulated for the user in the xtts vocab path.

To Reproduce Steps to reproduce the behaviour: get a windows 10 machine with a 1070, change the compute method from 16 to 32, load in an eight minut audio file, follow the finetuning instructions, using the app as standalone, complete steps one and two, and then try and complete step 3.

Screenshots

Text/logs

Training Environment: | > Backend: Torch | > Mixed precision: False | > Precision: float32 | > Current device: 0 | > Num. of GPUs: 1 | > Num. of CPUs: 8 | > Num. of Torch Threads: 1 | > Torch seed: 1 | > Torch CUDNN: True | > Torch CUDNN deterministic: False | > Torch CUDNN benchmark: False | > Torch TF32 MatMul: False Start Tensorboard: tensorboard --logdir=F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

Model has 517360175 parameters

[4m[1m > EPOCH: 0/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 20:39:04) [0m

[1m --> TIME: 2024-03-09 20:55:50 -- STEP: 0/14 -- GLOBAL_STEP: 0[0m | > loss_text_ce: 0.020058806985616684 (0.020058806985616684) | > loss_mel_ce: 4.53383207321167 (4.53383207321167) | > loss: 4.553890705108643 (4.553890705108643) | > grad_norm: 0 (0) | > current_lr: 5e-06 | > step_time: 92.1786 (92.17863059043884) | > loader_time: 907.1138 (907.1137893199921)

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time: 0.47786664962768555 [0m(+0) | > avg_loss_text_ce: 0.022705783136188984 [0m(+0) | > avg_loss_mel_ce: 4.2755126953125 [0m(+0) | > avg_loss: 4.298218488693237 [0m(+0)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_14.pth

[4m[1m > EPOCH: 1/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 21:06:55) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[92m 0.38800346851348877 [0m(-0.08986318111419678) | > avg_loss_text_ce:[92m 0.02224032487720251 [0m(-0.0004654582589864731) | > avg_loss_mel_ce:[92m 4.088733673095703 [0m(-0.18677902221679688) | > avg_loss:[92m 4.110974073410034 [0m(-0.18724441528320312)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_28.pth

[4m[1m > EPOCH: 2/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 21:36:16) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[91m 0.557515025138855 [0m(+0.1695115566253662) | > avg_loss_text_ce:[92m 0.02173413336277008 [0m(-0.0005061915144324303) | > avg_loss_mel_ce:[92m 3.9883835315704346 [0m(-0.10035014152526855) | > avg_loss:[92m 4.010117769241333 [0m(-0.10085630416870117)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_42.pth

[4m[1m > EPOCH: 3/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 21:59:59) [0m

[1m --> TIME: 2024-03-09 22:16:32 -- STEP: 8/14 -- GLOBAL_STEP: 50[0m | > loss_text_ce: 0.01974811963737011 (0.02200907445512712) | > loss_mel_ce: 4.236275672912598 (3.8399724662303925) | > loss: 4.25602388381958 (3.8619815707206726) | > grad_norm: 0 (0.0) | > current_lr: 5e-06 | > step_time: 4.0883 (6.187514752149582) | > loader_time: 1.5891 (0.37152737379074097)

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[92m 0.4740140438079834 [0m(-0.08350098133087158) | > avg_loss_text_ce:[92m 0.02119159046560526 [0m(-0.0005425428971648216) | > avg_loss_mel_ce:[92m 3.94236421585083 [0m(-0.04601931571960449) | > avg_loss:[92m 3.9635558128356934 [0m(-0.04656195640563965)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_56.pth

[4m[1m > EPOCH: 4/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 22:22:16) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[91m 0.4915173053741455 [0m(+0.01750326156616211) | > avg_loss_text_ce:[92m 0.02076542843133211 [0m(-0.0004261620342731476) | > avg_loss_mel_ce:[92m 3.9248008728027344 [0m(-0.017563343048095703) | > avg_loss:[92m 3.9455662965774536 [0m(-0.017989516258239746)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_70.pth

[4m[1m > EPOCH: 5/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 22:46:07) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[91m 0.8810181617736816 [0m(+0.38950085639953613) | > avg_loss_text_ce:[92m 0.020430893637239933 [0m(-0.00033453479409217834) | > avg_loss_mel_ce:[92m 3.8836196660995483 [0m(-0.041181206703186035) | > avg_loss:[92m 3.9040504693984985 [0m(-0.04151582717895508)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_84.pth

[4m[1m > EPOCH: 6/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 23:09:01) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[91m 1.2564053535461426 [0m(+0.37538719177246094) | > avg_loss_text_ce:[92m 0.02014606725424528 [0m(-0.0002848263829946518) | > avg_loss_mel_ce:[92m 3.8656585216522217 [0m(-0.01796114444732666) | > avg_loss:[92m 3.8858046531677246 [0m(-0.018245816230773926)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_98.pth

[4m[1m > EPOCH: 7/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-09 23:34:29) [0m

[1m --> TIME: 2024-03-09 23:50:29 -- STEP: 2/14 -- GLOBAL_STEP: 100[0m | > loss_text_ce: 0.018822336569428444 (0.01947619952261448) | > loss_mel_ce: 3.2692313194274902 (3.266197085380554) | > loss: 3.2880537509918213 (3.2856733798980713) | > grad_norm: 0 (0.0) | > current_lr: 5e-06 | > step_time: 6.2351 (4.781083464622498) | > loader_time: 0.023 (0.023505568504333496)

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[92m 0.9625129699707031 [0m(-0.29389238357543945) | > avg_loss_text_ce:[92m 0.019918162375688553 [0m(-0.00022790487855672836) | > avg_loss_mel_ce:[92m 3.8594086170196533 [0m(-0.006249904632568359) | > avg_loss:[92m 3.879326820373535 [0m(-0.006477832794189453)

BEST MODEL : F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f\best_model_112.pth

[4m[1m > EPOCH: 8/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-10 00:00:20) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[91m 1.17002272605896 [0m(+0.20750975608825684) | > avg_loss_text_ce:[92m 0.019755501300096512 [0m(-0.00016266107559204102) | > avg_loss_mel_ce:[91m 3.870388150215149 [0m(+0.010979533195495605) | > avg_loss:[91m 3.890143632888794 [0m(+0.010816812515258789)

[4m[1m > EPOCH: 9/10[0m --> F:\tts\alltalk_tts\finetune\tmp-trn\training\XTTS_FT-March-09-2024_08+38PM-9bb0d1f

[1m > TRAINING (2024-03-10 00:23:48) [0m

[1m > EVALUATION [0m

[1m--> EVAL PERFORMANCE[0m | > avg_loader_time:[92m 0.5635088682174683 [0m(-0.6065138578414917) | > avg_loss_text_ce:[92m 0.0196254700422287 [0m(-0.0001300312578678131) | > avg_loss_mel_ce:[91m 3.8840038776397705 [0m(+0.013615727424621582) | > avg_loss:[91m 3.903629422187805 [0m(+0.01348578929901123)

Desktop (please complete the following information): AllTalk was updated: Mar 2024 Custom Python environment: no Text-generation-webUI was updated: March 9th 2024

Additional context i installed the finetuning packages using the instructions on github, and not the setup file. "pip install -r requirements_finetune.txt"

These instructions dont seem like they are written for users using it in standalone mode. Because for the environment, I actually just opened up the command window, ran start_environment.bat from within that command window, and then proceeded from there. It worked (given that I change the computer method from 16 to 32 in finetune.py (I made a pull request for this))

Also, perhaps because I have an older card running 32 rather than 16, it took a few hours. The gradio instance timed out. Refreshing the page didnt solve the issue of populating the files in step 3, as the text on the page suggested. I had to restart the environment in the command window and run finetune.py again. This populated the two xtts files, but not the vocab.json (as it doesnt not exist). I dont think that this timeout is a related issue, but I wanted to mention it for completeness.

There would be no way to manually put in the path to a vocab.json file anyway, even if I did find it somewhere.

erew123 commented 8 months ago

Hi @johnbenac

The vocab file should be pulled down every time AllTalk starts up (if its missing) or on the first start-up, then placed in the \alltalk_tts\models\xttsv2_2.0.2 folder. Its then further copied from there/moved around as needed. Obviously if its not in that folder, then its wont get copied.

Can you check your \alltalk_tts\models\xttsv2_2.0.2 folder and see if the file is in there?

There have been some oddities since Huggingface had its outage a week or so back. Ive had 2x people who's files wouldn't download or downloaded html 404 not found files (this is a hugging face issue).

Let me know (Ill also test a fresh install here in the meantime).

Thanks

erew123 commented 8 months ago

Hi @johnbenac

Ive been through the whole process from start to finish and its all worked for me. I can only conclude that for some reason your system hasn't downloaded the vocab (and other files?)?

.So this is what you should have in your xttsv2_2.0.2 folder:

These files are pulled/should be pulled down when AllTalk starts up:

AllTalk also checks for these files each time it starts, so if any files are missing, they should be pulled down.

After running AllTalk for the first time, I went back and installed the Finetune requirements:

Worked through to Step 3 on the finetuning (Which I believe is where you are saying it got stuck):

Ive then tested Step 4 to copy over the files at the end:

Please can you confirm the files within your \alltalk_tts\models\xttsv2_2.0.2 folder and also just run start_alltalk.bat to see if it attempts to download any files?

Thanks

erew123 commented 8 months ago

FYI, I have also confirmed you can manually enter the path if necessary e.g.

This is my path example D:\alltalk_tts\models\xttsv2_2.0.2\vocab.json

I think in your case it would be F:\tts\alltalk_tts\models\xttsv2_2.0.2\vocab.json

johnbenac commented 8 months ago

Thanks for the response!

I had it running again overnight, and I noticed this in my terminal right at the end:

'cp' is not recognized as an internal or external command,
operable program or batch file.
'cp' is not recognized as an internal or external command,
operable program or batch file.
[FINETUNE] Model training done. Move to Step 3

I think the issue is that this:

##
os.system(f"cp {config_path} {exp_path}")
os.system(f"cp {vocab_file} {exp_path}")
##

needs to be changed to this:

##
import shutil
##
shutil.copy(config_path, exp_path)
shutil.copy(vocab_file, exp_path)
##

I've made the change on my own machine and am testing (although it takes a while on my machine!) If it works, I'll make a pull request.

If it doesnt work, I'll try what you said.

johnbenac commented 8 months ago

I do have those files where you said they were. And it should work if I manually move them over. But I'm going to let it run for a while and see if it works without the manual part.

erew123 commented 8 months ago

Hi @johnbenac

You're right those do need changing and Im working on updating some documentation within the finetuning atm, so I may as well make the change. I've also tested using the 32 bit float and I cant see an performance drop or problem, so I'm happy to pull in your PR on that one (thanks for letting me know older cards had an issue with float16).

Re the vocab and config files. I forgot to mention, the variables that are generated in step 2 for the location to these are stored within the browser window. So if you close the browser, refresh it or close and restart finetuning, the variables will be wiped out anyway, so you would have to manually point Step 3 at the files anyway.

During the actual training process, the vocab.json and config.json are not modified in any way, they are exactly the same as what comes with the base model, both before and after training has occurred. The only reason they were to be copied into the same folder is just for completeness before I implemented the "Option A" and "Option B" buttons, as people didn't understand that you just end up with a model file at the end of training and all the other files are exactly the same.

"Option A" and "Option B" buttons both just pull the files from the base model when they are copying the trained model into the trainedmodel or lastfinetuned folders.

So you will be fine to manually point step 3 to the files if needed and there is no difference for those 2x files after the trainng.

Thanks

johnbenac commented 8 months ago

Try an experiment, start fine tuneing a model, but before finishing step 2, step away from your computer for a few hours. Come back and see if step 3 populates.

Maybe my chrome puts the tab to sleep... maybe it's my add block chrome extension, maybe it's Gradio ... but I think that storing the variables in the browser, with Gradio, can be problematic. See if your browser does the job if you let it alone for a while.

johnbenac commented 8 months ago

ok, it finally finished, and it copied over the vocab.json file just fine. But the browser still didnt populate everything like it should:

So I closed down the python program and opened it up again (after a while), and it pulled up a window that had everything populated as it should be:

The model loaded successfully after a few minutes, and now I am working on step 4!

Something that made me think that I couldn't put the information in manually is that there is a downward facing carrot as in a dropdown menu visual indicator, which made me think that I couldn't type into the same field that looked like a dropdown menu. Maybe you can get rid of that carrot if the user has to do it manually? Perhaps? I don't know. Some other way to let the user know that they can manually put a path into that field.)

Question: Does the audio sound better when you use 32 rather than 16? For all that extra time, you should be getting something better for it, after all!

erew123 commented 8 months ago

HI @johnbenac

RE "Does the audio sound better when you use 32 rather than 16?" I assume you mean with the Whisper model on Step 1? It makes no difference to output sound quality, its purely to do with how the Whisper model is loaded/processed in RAM/VRAM, but has not resulting change on the audio files in any capacity.

RE " downward facing carrot" They are indeed dropdowns. Typically they will auto populate and some people train multiple models at a time, hence having dropdowns to select between the models they have trained. Bar refreshing the browser in some way, they should auto populate as you complete step 2 and move to step 3. You're actually the only person I've heard from whom has mentioned any issue with that. Ill make a mental note should anyone else have an issue.

Assuming this is the main thrust of this ticket resolved for now, Im going close it.

Thanks

erew123 / alltalk_tts

vocab.json file missing #115