IIEleven11 / StyleTTS2FineTune

177 stars 32 forks source link

Could you post your Configs/config_ft.ym ? #2

Closed Polarbear2121 closed 11 months ago

IIEleven11 commented 11 months ago

Yeah sure, So this was on an Nvidia A6000. I made some of the more important parameters bold.

I was playing with a few other things trying to pinpoint an f0 problem, I forgot what exactly I changed though.

Also keep in mind this config was a resumed fine tuning, you can see from this line i changed the model to 219 which is where I was resuming from. "pretrained_model: Models/LJSpeech/epoch_2nd_00219.pth."

This wont work if you copy and paste into your config because of this. Make sure you keep your "pretrained_model" as is until you need to resume

{ ASR_config: Utils/ASR/config.yml, ASR_path: Utils/ASR/epoch_00080.pth, F0_path: Utils/JDC/bst.t7, PLBERT_dir: Utils/PLBERT/, batch_size: 2, data_params: { OOD_data: /home/Ubuntu/Documents/style/stylenoenvs/StyleTTS2/Data/OOD_texts.txt, min_length: 50, root_path: /home/Ubuntu/Documents/style/stylenoenvs/StyleTTS2/Data/wavs, train_data: /home/Ubuntu/Documents/style/stylenoenvs/StyleTTS2/Data/train_list.txt, val_data: /home/Ubuntu/Documents/style/stylenoenvs/StyleTTS2/Data/val_list.txt, }, device: cuda, epochs: 350, load_only_params: false, log_dir: Models/LJSpeech, log_interval: 10, loss_params: { diff_epoch: 20, joint_epoch: 30, lambda_F0: 3.0, lambda_ce: 20.0, lambda_diff: 1.0, lambda_dur: 1.0, lambda_gen: 1.0, lambda_mel: 2.0, lambda_mono: 1.0, lambda_norm: 1.0, lambda_s2s: 1.0, lambda_slm: 1.0, lambda_sty: 1.0, }, max_len: 500, model_params: { decoder: { resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]], resblock_kernel_sizes: [3, 7, 11], type: hifigan, upsample_initial_channel: 512, upsample_kernel_sizes: [20, 10, 6, 4], upsample_rates: [10, 5, 3, 2], }, diffusion: { dist: { estimate_sigma_data: true, mean: -3.0, sigma_data: 0.19319299498227843, std: 1.0, }, embedding_mask_proba: 0.1, transformer: { head_features: 64, multiplier: 2, num_heads: 8, num_layers: 3 }, }, dim_in: 64, dropout: 0.2, hidden_dim: 512, max_conv_dim: 512, max_dur: 50, multispeaker: true, n_layer: 3, n_mels: 80, n_token: 178, slm: { hidden: 768, initial_channel: 64, model: microsoft/wavlm-base-plus, nlayers: 13, sr: 16000, }, style_dim: 128, }, optimizer_params: { bert_lr: 1.0e-05, ft_lr: 0.0001, lr: 0.0001 }, preprocess_params: { spect_params: { hop_length: 300, n_fft: 2048, win_length: 1200 }, sr: 24000, }, pretrained_model: Models/LJSpeech/epoch_2nd_00219.pth, save_freq: 10, second_stage_load_pretrained: true, slmadv_params: { batch_percentage: 0.5, iter: 10, max_len: 400, min_len: 300, scale: 0.01, sig: 1.5, thresh: 5, }, }

Polarbear2121 commented 11 months ago

I ran the fine tuning using Visual Studio Code(VSC) and this was the last output log:

Epoch [14/50], Step [570/1962], Loss: 0.29926, Disc Loss: 3.87392, Dur Loss: 0.39325, CE Loss: 0.01577, Norm Loss: 0.35677, F0 Loss: 1.75810, LM Loss: 1.10381, Gen Loss: 7.37969, Sty Loss: 0.11163, Diff Loss: 0.81120, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.05180, Mono Loss: 0.18753

But when I looked at the checked point the file name is epoch_2nd_00009.pth and it stopped recording six hours before than when I halted the code via the terminal window in VSC.

Also, something confusing is why we need to Download and extract the LJSpeech dataset if only we are doing fine tuning. Did you extract the database, and up-sample to 24kHz and then place it on log_dir: Models/LJSpeech?

Thank you!!!

IIEleven11 commented 11 months ago

I ran the fine tuning using Visual Studio Code(VSC) and this was the last output log:

Epoch [14/50], Step [570/1962], Loss: 0.29926, Disc Loss: 3.87392, Dur Loss: 0.39325, CE Loss: 0.01577, Norm Loss: 0.35677, F0 Loss: 1.75810, LM Loss: 1.10381, Gen Loss: 7.37969, Sty Loss: 0.11163, Diff Loss: 0.81120, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.05180, Mono Loss: 0.18753

But when I looked at the checked point the file name is epoch_2nd_00009.pth and it stopped recording six hours before than when I halted the code via the terminal window in VSC.

Also, something confusing is why we need to Download and extract the LJSpeech dataset if only we are doing fine tuning. Did you extract the database, and up-sample to 24kHz and then place it on log_dir: Models/LJSpeech?

Thank you!!!

It looks like it crashed for some reason. Your terminal should've had a traceback. You probably have it set to save checkpoints at everything 10th interval. Because computers count from zero and not one that means it saved the 9th epoch.

Uhm you just need the OOD_List from the dataset. For simplification I just included the download.

Correct, there is no reason to up sample lj speech dataset. I will clarify that in the readme.