Closed iamkhalidbashir closed 1 year ago
Also @Edresson
using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth
and d_vector_file
as a string and typo fix as suggested in the PR #2234
I am able to generate audio but the quality is bad with 1 epoch.
I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right?
I think it's because the model tts_models--en--vctk--vits
uses speaker ids as p231, p232, etc...
while my config uses it as VCTK_p231, VCTK_p232, etc...
Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402
Is this the reason or I have to look somewhere else?
Also @Edresson using
/root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth
andd_vector_file
as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the modeltts_models--en--vctk--vits
uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc...Because of this code is formatted (line 399)
Is this the reason or I have to look somewhere else?
The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge.
However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"
Describe the bug
Running this code with
restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Gives a log output ofModel restored from step 0
Full log:
> Training Environment: | > Current device: 0 | > Num. of GPUs: 1 | > Num. of CPUs: 16 | > Num. of Torch Threads: 24 | > Torch seed: 54321 | > Torch CUDNN: True | > Torch CUDNN deterministic: False | > Torch CUDNN benchmark: False > Restoring from model_file.pth ... > Restoring Model... > Partial model initialization... | > Layer missing in the model definition: speaker_encoder.conv1.weight | > Layer missing in the model definition: speaker_encoder.conv1.bias | > Layer missing in the model definition: speaker_encoder.bn1.weight | > Layer missing in the model definition: speaker_encoder.bn1.bias | > Layer missing in the model definition: speaker_encoder.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.bn1.running_var | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth. > `speakers_file` is updated in the config.json. | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb | > Layer missing in the model definition: speaker_encoder.attention.0.weight | > Layer missing in the model definition: speaker_encoder.attention.0.bias | > Layer missing in the model definition: speaker_encoder.attention.2.weight | > Layer missing in the model definition: speaker_encoder.attention.2.bias | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean | > Layer missing in the model definition: speaker_encoder.attention.2.running_var | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked | > Layer missing in the model definition: speaker_encoder.attention.3.weight | > Layer missing in the model definition: speaker_encoder.attention.3.bias | > Layer missing in the model definition: speaker_encoder.fc.weight | > Layer missing in the model definition: speaker_encoder.fc.bias | > Layer missing in the model definition: emb_l.weight | > Layer missing in the model definition: duration_predictor.cond_lang.weight | > Layer missing in the model definition: duration_predictor.cond_lang.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight | > 724 / 896 layers are restored. > Model restored from step 0 > Model has 86565676 parameters
Also When I run:
output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76" !tts --text "Hello, Michael how are you?" \ --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \ --config_path "/workspace/project/output/{output_dir}/config.json" \ --list_speaker_idxs \ --out_path /workspace/output.wav
to test then I get
> Using model: vits > Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:0 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:None | > fft_size:1024 | > power:None | > preemphasis:0.0 | > griffin_lim_iters:None | > signal_norm:None | > symmetric_norm:None | > mel_fmin:0 | > mel_fmax:None | > pitch_fmin:None | > pitch_fmax:None | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:1.0 | > clip_norm:True | > do_trim_silence:False | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:None | > base:10 | > hop_length:256 | > win_length:1024 > Model fully restored. > Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:64 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:512 | > power:1.5 | > preemphasis:0.97 | > griffin_lim_iters:60 | > signal_norm:False | > symmetric_norm:False | > mel_fmin:0 | > mel_fmax:8000.0 | > pitch_fmin:1.0 | > pitch_fmax:640.0 | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:False | > do_trim_silence:False | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:True | > db_level:-27.0 | > stats_path:None | > base:10 | > hop_length:160 | > win_length:400 > External Speaker Encoder Loaded !! > Model fully restored. > Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:64 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:512 | > power:1.5 | > preemphasis:0.97 | > griffin_lim_iters:60 | > signal_norm:False | > symmetric_norm:False | > mel_fmin:0 | > mel_fmax:8000.0 | > pitch_fmin:1.0 | > pitch_fmax:640.0 | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:False | > do_trim_silence:False | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:True | > db_level:-27.0 | > stats_path:None | > base:10 | > hop_length:160 | > win_length:400 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model. {}
Some how the interference cannot read speaker embeddings
Here is my config:
{ "output_path": "/workspace/project/output", "logger_uri": null, "run_name": "YourTTS-EN-VCTK", "project_name": "YourTTS", "run_description": "\n - Original YourTTS trained using VCTK dataset\n ", "print_step": 50, "plot_step": 100, "model_param_stats": false, "wandb_entity": null, "dashboard_logger": "tensorboard", "log_model_step": 1000, "save_step": 500, "save_n_checkpoints": 2, "save_checkpoints": true, "save_all_best": false, "save_best_after": 10000, "target_loss": "loss_1", "print_eval": true, "test_delay_epochs": 0, "run_eval": true, "run_eval_steps": null, "distributed_backend": "nccl", "distributed_url": "tcp://localhost:54321", "mixed_precision": false, "epochs": 1, "batch_size": 18, "eval_batch_size": 18, "grad_clip": [ 1000, 1000 ], "scheduler_after_epoch": true, "lr": 0.001, "optimizer": "AdamW", "optimizer_params": { "betas": [ 0.8, 0.99 ], "eps": 1e-09, "weight_decay": 0.01 }, "lr_scheduler": null, "lr_scheduler_params": null, "use_grad_scaler": false, "cudnn_enable": true, "cudnn_deterministic": false, "cudnn_benchmark": false, "training_seed": 54321, "model": "vits", "num_loader_workers": 8, "num_eval_loader_workers": 4, "use_noise_augment": false, "audio": { "fft_size": 1024, "sample_rate": 16000, "win_length": 1024, "hop_length": 256, "num_mels": 80, "mel_fmin": 0.0, "mel_fmax": null }, "use_phonemes": false, "phonemizer": "espeak", "phoneme_language": "en", "compute_input_seq_cache": true, "text_cleaner": "multilingual_cleaners", "enable_eos_bos_chars": false, "test_sentences_file": "", "phoneme_cache_path": null, "characters": { "characters_class": "TTS.tts.models.vits.VitsCharacters", "vocab_dict": null, "pad": "_", "eos": "&", "bos": "*", "blank": null, "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ", "punctuations": "!'(),-.:;? ", "phonemes": "", "is_unique": true, "is_sorted": true }, "add_blank": true, "batch_group_size": 5, "loss_masking": null, "min_audio_len": 1, "max_audio_len": 240000, "min_text_len": 1, "max_text_len": Infinity, "compute_f0": false, "compute_linear_spec": true, "precompute_num_workers": 12, "start_by_longest": true, "shuffle": false, "drop_last": false, "datasets": [ { "formatter": "vctk", "dataset_name": "vctk", "path": "/workspace/project/VCTK", "meta_file_train": "", "ignored_speakers": null, "language": "en", "meta_file_val": "", "meta_file_attn_mask": "" } ], "test_sentences": [ [ "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", "VCTK_p277", null, "en" ], [ "Be a voice, not an echo.", "VCTK_p239", null, "en" ], [ "I'm sorry Dave. I'm afraid I can't do that.", "VCTK_p258", null, "en" ], [ "This cake is great. It's so delicious and moist.", "VCTK_p244", null, "en" ], [ "Prior to November 22, 1963.", "VCTK_p305", null, "en" ] ], "eval_split_max_size": 256, "eval_split_size": 0.01, "use_speaker_weighted_sampler": false, "speaker_weighted_sampler_alpha": 1.0, "use_language_weighted_sampler": false, "language_weighted_sampler_alpha": 1.0, "use_length_weighted_sampler": false, "length_weighted_sampler_alpha": 1.0, "model_args": { "num_chars": 165, "out_channels": 513, "spec_segment_size": 32, "hidden_channels": 192, "hidden_channels_ffn_text_encoder": 768, "num_heads_text_encoder": 2, "num_layers_text_encoder": 10, "kernel_size_text_encoder": 3, "dropout_p_text_encoder": 0.1, "dropout_p_duration_predictor": 0.5, "kernel_size_posterior_encoder": 5, "dilation_rate_posterior_encoder": 1, "num_layers_posterior_encoder": 16, "kernel_size_flow": 5, "dilation_rate_flow": 1, "num_layers_flow": 4, "resblock_type_decoder": "2", "resblock_kernel_sizes_decoder": [ 3, 7, 11 ], "resblock_dilation_sizes_decoder": [ [ 1, 3, 5 ], [ 1, 3, 5 ], [ 1, 3, 5 ] ], "upsample_rates_decoder": [ 8, 8, 2, 2 ], "upsample_initial_channel_decoder": 512, "upsample_kernel_sizes_decoder": [ 16, 16, 4, 4 ], "periods_multi_period_discriminator": [ 2, 3, 5, 7, 11 ], "use_sdp": true, "noise_scale": 1.0, "inference_noise_scale": 0.667, "length_scale": 1, "noise_scale_dp": 1.0, "inference_noise_scale_dp": 1.0, "max_inference_len": null, "init_discriminator": true, "use_spectral_norm_disriminator": false, "use_speaker_embedding": false, "num_speakers": 0, "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth", "d_vector_file": [ "/workspace/project/VCTK/speakers.pth" ], "speaker_embedding_channels": 256, "use_d_vector_file": true, "d_vector_dim": 512, "detach_dp_input": true, "use_language_embedding": false, "embedded_language_dim": 4, "num_languages": 0, "language_ids_file": null, "use_speaker_encoder_as_loss": true, "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json", "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar", "condition_dp_on_speaker": true, "freeze_encoder": false, "freeze_DP": false, "freeze_PE": false, "freeze_flow_decoder": false, "freeze_waveform_decoder": false, "encoder_sample_rate": null, "interpolate_z": true, "reinit_DP": false, "reinit_text_encoder": false }, "lr_gen": 0.0002, "lr_disc": 0.0002, "lr_scheduler_gen": "ExponentialLR", "lr_scheduler_gen_params": { "gamma": 0.999875, "last_epoch": -1 }, "lr_scheduler_disc": "ExponentialLR", "lr_scheduler_disc_params": { "gamma": 0.999875, "last_epoch": -1 }, "kl_loss_alpha": 1.0, "disc_loss_alpha": 1.0, "gen_loss_alpha": 1.0, "feat_loss_alpha": 1.0, "mel_loss_alpha": 45.0, "dur_loss_alpha": 1.0, "speaker_encoder_loss_alpha": 9.0, "return_wav": true, "use_weighted_sampler": false, "weighted_sampler_attrs": null, "weighted_sampler_multipliers": null, "r": 1, "num_speakers": 0, "use_speaker_embedding": false, "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth", "speaker_embedding_channels": 256, "language_ids_file": null, "use_language_embedding": false, "use_d_vector_file": true, "d_vector_file": [ "/workspace/project/VCTK/speakers.pth" ], "d_vector_dim": 512 }
It might be because of a typo on line #114:-
Where it should be
speakers_file
instead ofspeaker_file
? Also, After disablingmodel_args.use_d_vector_file
and enablingmodel_args.use_speaker_embedding
I get this error:-> Using model: vits > Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:0 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:None | > fft_size:1024 | > power:None | > preemphasis:0.0 | > griffin_lim_iters:None | > signal_norm:None | > symmetric_norm:None | > mel_fmin:0 | > mel_fmax:None | > pitch_fmin:None | > pitch_fmax:None | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:1.0 | > clip_norm:True | > do_trim_silence:False | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:None | > base:10 | > hop_length:256 | > win_length:1024 > Model fully restored. > Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:64 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:512 | > power:1.5 | > preemphasis:0.97 | > griffin_lim_iters:60 | > signal_norm:False | > symmetric_norm:False | > mel_fmin:0 | > mel_fmax:8000.0 | > pitch_fmin:1.0 | > pitch_fmax:640.0 | > spec_gain:20.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:False | > do_trim_silence:False | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:True | > db_level:-27.0 | > stats_path:None | > base:10 | > hop_length:160 | > win_length:400 > initialization of speaker-embedding layers. > External Speaker Encoder Loaded !! Traceback (most recent call last): File "/opt/conda/bin/tts", line 8, in <module> sys.exit(main()) File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main args.use_cuda, File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__ self._load_tts(tts_checkpoint, tts_config_path, use_cuda) File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True) File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape: KeyError: 'emb_g.weight'
Also when restoring from /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth for 22kHz sample files I do get Model restored from step 1000000 but rest of the interference errors are the same
Guys @erogol @Edresson Am I doing something wrong or should I create an issue?
To Reproduce
Run train_yourtts with default params
Expected behavior
No response
Logs
No response
Environment
Nvidia 3090
Additional context
No response
The issue is the Type of d_vector_file inside of VitsArgs and VitsConfig. It should be "List[str]" and not "str". We are working in the proper fix on #2234.
Also @Edresson using
/root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth
andd_vector_file
as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the modeltts_models--en--vctk--vits
uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc... Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402Is this the reason or I have to look somewhere else?
The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge.
However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"
Got it! So should I ignore the message (as mentioned in above logs):
| > 724 / 896 layers are restored.
> Model restored from step 0
This happens when I use "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth" as restore path
Also @Edresson using
/root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth
andd_vector_file
as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the modeltts_models--en--vctk--vits
uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc... Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402Is this the reason or I have to look somewhere else?
The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge. However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"
Got it! So should I ignore the message (as mentioned in above logs):
| > 724 / 896 layers are restored. > Model restored from step 0
This happens when I use "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth" as restore path
Yes, the following weights will be not loaded "speaker_encoder." (because we changed this part and this is now on speaker_manager), "emb_l.weight" (because it is not a multilingual training) and "duration_predictor.cond_lang." (because it is not a multilingual training).
Thanks @Edresson
Just a quick concern I wanted to ask you input:-
My speakers.pth
has following when --list_speaker_idxs
is used:-
{'VCTK_p225': 0, 'VCTK_p226': 1, 'VCTK_p227': 2, 'VCTK_p228': 3, 'VCTK_p229': 4, 'VCTK_p230': 5, 'VCTK_p231': 6, 'VCTK_p232': 7, 'VCTK_p233': 8, 'VCTK_p234': 9, 'VCTK_p236': 10, 'VCTK_p237': 11, 'VCTK_p238': 12, 'VCTK_p239': 13, 'VCTK_p240': 14, 'VCTK_p241': 15, 'VCTK_p243': 16, 'VCTK_p244': 17, 'VCTK_p245': 18, 'VCTK_p246': 19, 'VCTK_p247': 20, 'VCTK_p248': 21, 'VCTK_p249': 22, 'VCTK_p250': 23, 'VCTK_p251': 24, 'VCTK_p252': 25, 'VCTK_p253': 26, 'VCTK_p254': 27, 'VCTK_p255': 28, 'VCTK_p256': 29, 'VCTK_p257': 30, 'VCTK_p258': 31, 'VCTK_p259': 32, 'VCTK_p260': 33, 'VCTK_p261': 34, 'VCTK_p262': 35, 'VCTK_p263': 36, 'VCTK_p264': 37, 'VCTK_p265': 38, 'VCTK_p266': 39, 'VCTK_p267': 40, 'VCTK_p268': 41, 'VCTK_p269': 42, 'VCTK_p270': 43, 'VCTK_p271': 44, 'VCTK_p272': 45, 'VCTK_p273': 46, 'VCTK_p274': 47, 'VCTK_p275': 48, 'VCTK_p276': 49, 'VCTK_p277': 50, 'VCTK_p278': 51, 'VCTK_p279': 52, 'VCTK_p280': 53, 'VCTK_p281': 54, 'VCTK_p282': 55, 'VCTK_p283': 56, 'VCTK_p284': 57, 'VCTK_p285': 58, 'VCTK_p286': 59, 'VCTK_p287': 60, 'VCTK_p288': 61, 'VCTK_p292': 62, 'VCTK_p293': 63, 'VCTK_p294': 64, 'VCTK_p295': 65, 'VCTK_p297': 66, 'VCTK_p298': 67, 'VCTK_p299': 68, 'VCTK_p300': 69, 'VCTK_p301': 70, 'VCTK_p302': 71, 'VCTK_p303': 72, 'VCTK_p304': 73, 'VCTK_p305': 74, 'VCTK_p306': 75, 'VCTK_p307': 76, 'VCTK_p308': 77, 'VCTK_p310': 78, 'VCTK_p311': 79, 'VCTK_p312': 80, 'VCTK_p313': 81, 'VCTK_p314': 82, 'VCTK_p316': 83, 'VCTK_p317': 84, 'VCTK_p318': 85, 'VCTK_p323': 86, 'VCTK_p326': 87, 'VCTK_p329': 88, 'VCTK_p330': 89, 'VCTK_p333': 90, 'VCTK_p334': 91, 'VCTK_p335': 92, 'VCTK_p336': 93, 'VCTK_p339': 94, 'VCTK_p340': 95, 'VCTK_p341': 96, 'VCTK_p343': 97, 'VCTK_p345': 98, 'VCTK_p347': 99, 'VCTK_p351': 100, 'VCTK_p360': 101, 'VCTK_p361': 102, 'VCTK_p363': 103, 'VCTK_p364': 104, 'VCTK_p374': 105, 'VCTK_p376': 106, 'VCTK_s5': 107, 'VCTK_old_new_voice': 0}
As you can see voice VCTK_p225 and VCTK_old_new_voice (my new voice loaded with formated vctk_old) has both id 0 after I pass my new voice in the DATASETS_CONFIG_LIST Is this a problem?
Also, it looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7
Although I have set everything en
Thanks @Edresson Just a quick concern I wanted to ask you input:- My
speakers.pth
has following when--list_speaker_idxs
is used:-{'VCTK_p225': 0, 'VCTK_p226': 1, 'VCTK_p227': 2, 'VCTK_p228': 3, 'VCTK_p229': 4, 'VCTK_p230': 5, 'VCTK_p231': 6, 'VCTK_p232': 7, 'VCTK_p233': 8, 'VCTK_p234': 9, 'VCTK_p236': 10, 'VCTK_p237': 11, 'VCTK_p238': 12, 'VCTK_p239': 13, 'VCTK_p240': 14, 'VCTK_p241': 15, 'VCTK_p243': 16, 'VCTK_p244': 17, 'VCTK_p245': 18, 'VCTK_p246': 19, 'VCTK_p247': 20, 'VCTK_p248': 21, 'VCTK_p249': 22, 'VCTK_p250': 23, 'VCTK_p251': 24, 'VCTK_p252': 25, 'VCTK_p253': 26, 'VCTK_p254': 27, 'VCTK_p255': 28, 'VCTK_p256': 29, 'VCTK_p257': 30, 'VCTK_p258': 31, 'VCTK_p259': 32, 'VCTK_p260': 33, 'VCTK_p261': 34, 'VCTK_p262': 35, 'VCTK_p263': 36, 'VCTK_p264': 37, 'VCTK_p265': 38, 'VCTK_p266': 39, 'VCTK_p267': 40, 'VCTK_p268': 41, 'VCTK_p269': 42, 'VCTK_p270': 43, 'VCTK_p271': 44, 'VCTK_p272': 45, 'VCTK_p273': 46, 'VCTK_p274': 47, 'VCTK_p275': 48, 'VCTK_p276': 49, 'VCTK_p277': 50, 'VCTK_p278': 51, 'VCTK_p279': 52, 'VCTK_p280': 53, 'VCTK_p281': 54, 'VCTK_p282': 55, 'VCTK_p283': 56, 'VCTK_p284': 57, 'VCTK_p285': 58, 'VCTK_p286': 59, 'VCTK_p287': 60, 'VCTK_p288': 61, 'VCTK_p292': 62, 'VCTK_p293': 63, 'VCTK_p294': 64, 'VCTK_p295': 65, 'VCTK_p297': 66, 'VCTK_p298': 67, 'VCTK_p299': 68, 'VCTK_p300': 69, 'VCTK_p301': 70, 'VCTK_p302': 71, 'VCTK_p303': 72, 'VCTK_p304': 73, 'VCTK_p305': 74, 'VCTK_p306': 75, 'VCTK_p307': 76, 'VCTK_p308': 77, 'VCTK_p310': 78, 'VCTK_p311': 79, 'VCTK_p312': 80, 'VCTK_p313': 81, 'VCTK_p314': 82, 'VCTK_p316': 83, 'VCTK_p317': 84, 'VCTK_p318': 85, 'VCTK_p323': 86, 'VCTK_p326': 87, 'VCTK_p329': 88, 'VCTK_p330': 89, 'VCTK_p333': 90, 'VCTK_p334': 91, 'VCTK_p335': 92, 'VCTK_p336': 93, 'VCTK_p339': 94, 'VCTK_p340': 95, 'VCTK_p341': 96, 'VCTK_p343': 97, 'VCTK_p345': 98, 'VCTK_p347': 99, 'VCTK_p351': 100, 'VCTK_p360': 101, 'VCTK_p361': 102, 'VCTK_p363': 103, 'VCTK_p364': 104, 'VCTK_p374': 105, 'VCTK_p376': 106, 'VCTK_s5': 107, 'VCTK_old_new_voice': 0}
As you can see voice VCTK_p225 and VCTK_old_new_voice (my new voice loaded with formated vctk_old) has both id 0 after I pass my new voice in the DATASETS_CONFIG_LIST Is this a problem?
Also, it looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything
en
It should not effect the training or inference because we use the speaker name and not the Ids. But yeah it is weird and can cause confusion, I fixed it on https://github.com/coqui-ai/TTS/pull/2234/commits/c8245cde075911a2137d7963feff2abbf48d5d07
Amazing! Quick question:
Q1: It looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything to en
Q2: To fine-tune the new voice do I really need vctk dataset or the restored model path is enough? and how many steps for a good quality fine-tuning of my voice should I expect?
Thanks
For example this (https://voca.ro/168hwUXVT4wy) looks like some when speaking in another language although I have set en
as the language for my data set
My config:
{
"output_path": "/workspace/project/output",
"logger_uri": null,
"run_name": "YourTTS-EN-VCTK",
"project_name": "YourTTS",
"run_description": "\n - Original YourTTS trained using VCTK dataset\n ",
"print_step": 50,
"plot_step": 100,
"model_param_stats": false,
"wandb_entity": null,
"dashboard_logger": "tensorboard",
"log_model_step": 1000,
"save_step": 500,
"save_n_checkpoints": 2,
"save_checkpoints": true,
"save_all_best": false,
"save_best_after": 10000,
"target_loss": "loss_1",
"print_eval": true,
"test_delay_epochs": 0,
"run_eval": true,
"run_eval_steps": null,
"distributed_backend": "nccl",
"distributed_url": "tcp://localhost:54321",
"mixed_precision": false,
"epochs": 10,
"batch_size": 24,
"eval_batch_size": 24,
"grad_clip": [
1000.0,
1000.0
],
"scheduler_after_epoch": true,
"lr": 0.001,
"optimizer": "AdamW",
"optimizer_params": {
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"weight_decay": 0.01
},
"lr_scheduler": null,
"lr_scheduler_params": null,
"use_grad_scaler": false,
"cudnn_enable": true,
"cudnn_deterministic": false,
"cudnn_benchmark": false,
"training_seed": 54321,
"model": "vits",
"num_loader_workers": 8,
"num_eval_loader_workers": 4,
"use_noise_augment": false,
"audio": {
"fft_size": 1024,
"sample_rate": 16000,
"win_length": 1024,
"hop_length": 256,
"num_mels": 80,
"mel_fmin": 0,
"mel_fmax": null
},
"use_phonemes": false,
"phonemizer": "espeak",
"phoneme_language": "en",
"compute_input_seq_cache": true,
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"test_sentences_file": "",
"phoneme_cache_path": null,
"characters": {
"characters_class": "TTS.tts.models.vits.VitsCharacters",
"vocab_dict": null,
"pad": "_",
"eos": "&",
"bos": "*",
"blank": null,
"characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
"punctuations": "!'(),-.:;? ",
"phonemes": "",
"is_unique": true,
"is_sorted": true
},
"add_blank": true,
"batch_group_size": 5,
"loss_masking": null,
"min_audio_len": 1,
"max_audio_len": 160000,
"min_text_len": 1,
"max_text_len": Infinity,
"compute_f0": false,
"compute_linear_spec": true,
"precompute_num_workers": 12,
"start_by_longest": true,
"shuffle": false,
"drop_last": false,
"datasets": [
{
"formatter": "vctk",
"dataset_name": "vctk",
"path": "/workspace/project/VCTK",
"meta_file_train": "",
"ignored_speakers": [
"p261",
"p225",
"p294",
"p347",
"p238",
"p234",
"p248",
"p335",
"p245",
"p326",
"p302"
],
"language": "en",
"meta_file_val": "",
"meta_file_attn_mask": ""
},
{
"formatter": "vctk_old",
"dataset_name": "newspeaker",
"path": "/workspace/project/datasets/madaliene",
"meta_file_train": "",
"ignored_speakers": null,
"language": "en",
"meta_file_val": "",
"meta_file_attn_mask": ""
}
],
"test_sentences": [
[
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
"VCTK_old_new_voice",
null,
"en"
],
[
"Be a voice, not an echo.",
"VCTK_p239",
null,
"en"
],
[
"I'm sorry Dave. I'm afraid I can't do that.",
"VCTK_old_new_voice",
null,
"en"
],
[
"This cake is great. It's so delicious and moist.",
"VCTK_p244",
null,
"en"
],
[
"Prior to November 22, 1963.",
"VCTK_p305",
null,
"en"
]
],
"eval_split_max_size": 256,
"eval_split_size": 0.01,
"use_speaker_weighted_sampler": false,
"speaker_weighted_sampler_alpha": 1.0,
"use_language_weighted_sampler": false,
"language_weighted_sampler_alpha": 1.0,
"use_length_weighted_sampler": false,
"length_weighted_sampler_alpha": 1.0,
"model_args": {
"num_chars": 165,
"out_channels": 513,
"spec_segment_size": 62,
"hidden_channels": 192,
"hidden_channels_ffn_text_encoder": 768,
"num_heads_text_encoder": 2,
"num_layers_text_encoder": 10,
"kernel_size_text_encoder": 3,
"dropout_p_text_encoder": 0.1,
"dropout_p_duration_predictor": 0.5,
"kernel_size_posterior_encoder": 5,
"dilation_rate_posterior_encoder": 1,
"num_layers_posterior_encoder": 16,
"kernel_size_flow": 5,
"dilation_rate_flow": 1,
"num_layers_flow": 4,
"resblock_type_decoder": "2",
"resblock_kernel_sizes_decoder": [
3,
7,
11
],
"resblock_dilation_sizes_decoder": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates_decoder": [
8,
8,
2,
2
],
"upsample_initial_channel_decoder": 512,
"upsample_kernel_sizes_decoder": [
16,
16,
4,
4
],
"periods_multi_period_discriminator": [
2,
3,
5,
7,
11
],
"use_sdp": true,
"noise_scale": 1.0,
"inference_noise_scale": 0.667,
"length_scale": 1.0,
"noise_scale_dp": 1.0,
"inference_noise_scale_dp": 1.0,
"max_inference_len": null,
"init_discriminator": true,
"use_spectral_norm_disriminator": false,
"use_speaker_embedding": false,
"num_speakers": 0,
"speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-23-2022_10+35AM-c8245cde/speakers.pth",
"d_vector_file": [
"/workspace/project/VCTK/speakers.pth",
"/workspace/project/datasets/madaliene/speakers.pth"
],
"speaker_embedding_channels": 512,
"use_d_vector_file": true,
"d_vector_dim": 512,
"detach_dp_input": true,
"use_language_embedding": false,
"embedded_language_dim": 4,
"num_languages": 0,
"language_ids_file": null,
"use_speaker_encoder_as_loss": false,
"speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
"speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
"condition_dp_on_speaker": true,
"freeze_encoder": false,
"freeze_DP": false,
"freeze_PE": false,
"freeze_flow_decoder": false,
"freeze_waveform_decoder": false,
"encoder_sample_rate": null,
"interpolate_z": true,
"reinit_DP": false,
"reinit_text_encoder": false
},
"lr_gen": 0.0002,
"lr_disc": 0.0002,
"lr_scheduler_gen": "ExponentialLR",
"lr_scheduler_gen_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"lr_scheduler_disc": "ExponentialLR",
"lr_scheduler_disc_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"kl_loss_alpha": 1.0,
"disc_loss_alpha": 1.0,
"gen_loss_alpha": 1.0,
"feat_loss_alpha": 1.0,
"mel_loss_alpha": 45.0,
"dur_loss_alpha": 1.0,
"speaker_encoder_loss_alpha": 9.0,
"return_wav": true,
"use_weighted_sampler": false,
"weighted_sampler_attrs": {
"speaker_name": 1.0
},
"weighted_sampler_multipliers": {
"speaker_name": {}
},
"r": 1,
"num_speakers": 0,
"use_speaker_embedding": false,
"speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-23-2022_10+35AM-c8245cde/speakers.pth",
"speaker_embedding_channels": 512,
"language_ids_file": null,
"use_language_embedding": false,
"use_d_vector_file": true,
"d_vector_file": [
"/workspace/project/VCTK/speakers.pth",
"/workspace/project/datasets/madaliene/speakers.pth"
],
"d_vector_dim": 512
}
Also @Edresson @erogol I think there is a bug/misconfiguration in this line as well https://github.com/coqui-ai/TTS/blob/0910cb76bcd85df56bf43654bb31427647cdfd0d/recipes/vctk/yourtts/train_yourtts.py#L206-L209
Error:
> EPOCH: 0/10
--> /content/project/output/YourTTS-EN-VCTK-December-26-2022_08+20AM-c8245cde
> DataLoader initialization
| > Tokenizer:
| > add_blank: True
| > use_eos_bos: False
| > use_phonemes: False
| > Number of instances : 39959
! Run is kept in /content/project/output/YourTTS-EN-VCTK-December-26-2022_08+20AM-c8245cde
| > Preprocessing samples
| > Max text length: 388
| > Min text length: 10
| > Avg text length: 40.86955977676002
|
| > Max audio length: 159781.0
| > Min audio length: 7698.0
| > Avg audio length: 24937.56375603774
| > Num. instances discarded samples: 2
| > Batch group size: 50.
> Using weighted sampler for attribute 'speaker_name' with alpha '1.0'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1591, in fit
self._fit()
File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1544, in _fit
self.train_epoch()
File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1292, in train_epoch
self.train_loader = self.get_train_dataloader(
File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 803, in get_train_dataloader
return self._get_loader(
File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 767, in _get_loader
loader = model.get_data_loader(
File "/content/project/TTS/TTS/tts/models/vits.py", line 1621, in get_data_loader
sampler = self.get_sampler(config, dataset, num_gpus)
File "/content/project/TTS/TTS/tts/models/vits.py", line 1554, in get_sampler
multi_dict = config.weighted_sampler_multipliers.get(attr_name, None)
AttributeError: 'NoneType' object has no attribute 'get'
An exception has occurred, use %tb to see the full traceback.
SystemExit: 1
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3334: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
The training is failing because I think weighted_sampler_multipliers
must be supplied!
weighted_sampler_multipliers
On my side the training works well. weighted_sampler_multipliers
is supplied as an empty dict as default so it should works.
Amazing! Quick question: Q1: It looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything to
en
Q2: To fine-tune the new voice do I really need vctk dataset or the restored model path is enough? and how many steps for a good quality fine-tuning of my voice should I expect?
Thanks
Could you please take a look at this
Q1: Not sure what is happening. I fine-tuned the model with the recipe per 3 epochs and the voices look great. Try to run the recipe as it is on PR #2234 (without your dataset).
Q2: It depends on how much data you have. I recommend the original training data + the new speaker samples as we did on YourTTS paper.
I recommend the original training data
Doesn't this download the original training data from the yourtts paper?
https://github.com/coqui-ai/TTS/blob/0a9d28def1ac168540198836701fcfc9d665aa0d/recipes/vctk/yourtts/train_yourtts.py#L52-L56
I recommend the original training data
Doesn't this download the original training data from the yourtts paper?
For the first experiment yes. In Experiment 1, YourTTS was trained using only VCTK dataset.
Also @Edresson using
/root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth
andd_vector_file
as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the modeltts_models--en--vctk--vits
uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc... Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402Is this the reason or I have to look somewhere else?
The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge. However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"
Got it! So should I ignore the message (as mentioned in above logs):
| > 724 / 896 layers are restored. > Model restored from step 0
This happens when I use "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth" as restore path
Yes, the following weights will be not loaded "speakerencoder." (because we changed this part and this is now on speaker_manager), "emb_l.weight" (because it is not a multilingual training) and "duration_predictor.condlang." (because it is not a multilingual training).
After struggling for days I found why I had issue with my fine-tuned model, its because yourtts model is multilingual so I had to turn on
use_language_embedding=True
In order to guide my new model what language to train on
Describe the bug
Running this code with
restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Gives a log output ofModel restored from step 0
Full log:
Also When I run:
to test then I get
Some how the interference cannot read speaker embeddings
Here is my config:
It might be because of a typo on line #114:- https://github.com/coqui-ai/TTS/blob/9e5a469c64ca7121d3558f3ddf40b1a3e993ffcc/TTS/tts/utils/speakers.py#L110-L120 Where it should be
speakers_file
instead ofspeaker_file
?Also, After disabling
model_args.use_d_vector_file
and enablingmodel_args.use_speaker_embedding
I get this error:-Also when restoring from /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth for 22kHz sample files I do get Model restored from step 1000000 but rest of the interference errors are the same
Guys @erogol @Edresson Am I doing something wrong or should I create an issue?
To Reproduce
Run train_yourtts with default params
Expected behavior
No response
Logs
No response
Environment
Additional context
No response