coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.86k stars 4.25k forks source link

train_yourtts speaker embeddings does not generate audio #2236

Closed iamkhalidbashir closed 1 year ago

iamkhalidbashir commented 1 year ago

Describe the bug

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:- https://github.com/coqui-ai/TTS/blob/9e5a469c64ca7121d3558f3ddf40b1a3e993ffcc/TTS/tts/utils/speakers.py#L110-L120 Where it should be speakers_file instead of speaker_file?

Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Also when restoring from /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth for 22kHz sample files I do get Model restored from step 1000000 but rest of the interference errors are the same

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

To Reproduce

Run train_yourtts with default params

Expected behavior

No response

Logs

No response

Environment

Nvidia 3090

Additional context

No response

iamkhalidbashir commented 1 year ago

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc...

Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?

Edresson commented 1 year ago

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc...

Because of this code is formatted (line 399)

https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?

The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge.

However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

Edresson commented 1 year ago

Describe the bug

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

https://github.com/coqui-ai/TTS/blob/9e5a469c64ca7121d3558f3ddf40b1a3e993ffcc/TTS/tts/utils/speakers.py#L110-L120

Where it should be speakers_file instead of speaker_file? Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Also when restoring from /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth for 22kHz sample files I do get Model restored from step 1000000 but rest of the interference errors are the same

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

To Reproduce

Run train_yourtts with default params

Expected behavior

No response

Logs

No response

Environment

Nvidia 3090

Additional context

No response

The issue is the Type of d_vector_file inside of VitsArgs and VitsConfig. It should be "List[str]" and not "str". We are working in the proper fix on #2234.

iamkhalidbashir commented 1 year ago

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc... Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?

The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge.

However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

Got it! So should I ignore the message (as mentioned in above logs):

 | > 724 / 896 layers are restored.
 > Model restored from step 0

This happens when I use "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth" as restore path

Edresson commented 1 year ago

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc... Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?

The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge. However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

Got it! So should I ignore the message (as mentioned in above logs):

 | > 724 / 896 layers are restored.
 > Model restored from step 0

This happens when I use "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth" as restore path

Yes, the following weights will be not loaded "speaker_encoder." (because we changed this part and this is now on speaker_manager), "emb_l.weight" (because it is not a multilingual training) and "duration_predictor.cond_lang." (because it is not a multilingual training).

iamkhalidbashir commented 1 year ago

Thanks @Edresson Just a quick concern I wanted to ask you input:- My speakers.pth has following when --list_speaker_idxs is used:-

{'VCTK_p225': 0, 'VCTK_p226': 1, 'VCTK_p227': 2, 'VCTK_p228': 3, 'VCTK_p229': 4, 'VCTK_p230': 5, 'VCTK_p231': 6, 'VCTK_p232': 7, 'VCTK_p233': 8, 'VCTK_p234': 9, 'VCTK_p236': 10, 'VCTK_p237': 11, 'VCTK_p238': 12, 'VCTK_p239': 13, 'VCTK_p240': 14, 'VCTK_p241': 15, 'VCTK_p243': 16, 'VCTK_p244': 17, 'VCTK_p245': 18, 'VCTK_p246': 19, 'VCTK_p247': 20, 'VCTK_p248': 21, 'VCTK_p249': 22, 'VCTK_p250': 23, 'VCTK_p251': 24, 'VCTK_p252': 25, 'VCTK_p253': 26, 'VCTK_p254': 27, 'VCTK_p255': 28, 'VCTK_p256': 29, 'VCTK_p257': 30, 'VCTK_p258': 31, 'VCTK_p259': 32, 'VCTK_p260': 33, 'VCTK_p261': 34, 'VCTK_p262': 35, 'VCTK_p263': 36, 'VCTK_p264': 37, 'VCTK_p265': 38, 'VCTK_p266': 39, 'VCTK_p267': 40, 'VCTK_p268': 41, 'VCTK_p269': 42, 'VCTK_p270': 43, 'VCTK_p271': 44, 'VCTK_p272': 45, 'VCTK_p273': 46, 'VCTK_p274': 47, 'VCTK_p275': 48, 'VCTK_p276': 49, 'VCTK_p277': 50, 'VCTK_p278': 51, 'VCTK_p279': 52, 'VCTK_p280': 53, 'VCTK_p281': 54, 'VCTK_p282': 55, 'VCTK_p283': 56, 'VCTK_p284': 57, 'VCTK_p285': 58, 'VCTK_p286': 59, 'VCTK_p287': 60, 'VCTK_p288': 61, 'VCTK_p292': 62, 'VCTK_p293': 63, 'VCTK_p294': 64, 'VCTK_p295': 65, 'VCTK_p297': 66, 'VCTK_p298': 67, 'VCTK_p299': 68, 'VCTK_p300': 69, 'VCTK_p301': 70, 'VCTK_p302': 71, 'VCTK_p303': 72, 'VCTK_p304': 73, 'VCTK_p305': 74, 'VCTK_p306': 75, 'VCTK_p307': 76, 'VCTK_p308': 77, 'VCTK_p310': 78, 'VCTK_p311': 79, 'VCTK_p312': 80, 'VCTK_p313': 81, 'VCTK_p314': 82, 'VCTK_p316': 83, 'VCTK_p317': 84, 'VCTK_p318': 85, 'VCTK_p323': 86, 'VCTK_p326': 87, 'VCTK_p329': 88, 'VCTK_p330': 89, 'VCTK_p333': 90, 'VCTK_p334': 91, 'VCTK_p335': 92, 'VCTK_p336': 93, 'VCTK_p339': 94, 'VCTK_p340': 95, 'VCTK_p341': 96, 'VCTK_p343': 97, 'VCTK_p345': 98, 'VCTK_p347': 99, 'VCTK_p351': 100, 'VCTK_p360': 101, 'VCTK_p361': 102, 'VCTK_p363': 103, 'VCTK_p364': 104, 'VCTK_p374': 105, 'VCTK_p376': 106, 'VCTK_s5': 107, 'VCTK_old_new_voice': 0}

As you can see voice VCTK_p225 and VCTK_old_new_voice (my new voice loaded with formated vctk_old) has both id 0 after I pass my new voice in the DATASETS_CONFIG_LIST Is this a problem?

Also, it looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything en

Edresson commented 1 year ago

Thanks @Edresson Just a quick concern I wanted to ask you input:- My speakers.pth has following when --list_speaker_idxs is used:-

{'VCTK_p225': 0, 'VCTK_p226': 1, 'VCTK_p227': 2, 'VCTK_p228': 3, 'VCTK_p229': 4, 'VCTK_p230': 5, 'VCTK_p231': 6, 'VCTK_p232': 7, 'VCTK_p233': 8, 'VCTK_p234': 9, 'VCTK_p236': 10, 'VCTK_p237': 11, 'VCTK_p238': 12, 'VCTK_p239': 13, 'VCTK_p240': 14, 'VCTK_p241': 15, 'VCTK_p243': 16, 'VCTK_p244': 17, 'VCTK_p245': 18, 'VCTK_p246': 19, 'VCTK_p247': 20, 'VCTK_p248': 21, 'VCTK_p249': 22, 'VCTK_p250': 23, 'VCTK_p251': 24, 'VCTK_p252': 25, 'VCTK_p253': 26, 'VCTK_p254': 27, 'VCTK_p255': 28, 'VCTK_p256': 29, 'VCTK_p257': 30, 'VCTK_p258': 31, 'VCTK_p259': 32, 'VCTK_p260': 33, 'VCTK_p261': 34, 'VCTK_p262': 35, 'VCTK_p263': 36, 'VCTK_p264': 37, 'VCTK_p265': 38, 'VCTK_p266': 39, 'VCTK_p267': 40, 'VCTK_p268': 41, 'VCTK_p269': 42, 'VCTK_p270': 43, 'VCTK_p271': 44, 'VCTK_p272': 45, 'VCTK_p273': 46, 'VCTK_p274': 47, 'VCTK_p275': 48, 'VCTK_p276': 49, 'VCTK_p277': 50, 'VCTK_p278': 51, 'VCTK_p279': 52, 'VCTK_p280': 53, 'VCTK_p281': 54, 'VCTK_p282': 55, 'VCTK_p283': 56, 'VCTK_p284': 57, 'VCTK_p285': 58, 'VCTK_p286': 59, 'VCTK_p287': 60, 'VCTK_p288': 61, 'VCTK_p292': 62, 'VCTK_p293': 63, 'VCTK_p294': 64, 'VCTK_p295': 65, 'VCTK_p297': 66, 'VCTK_p298': 67, 'VCTK_p299': 68, 'VCTK_p300': 69, 'VCTK_p301': 70, 'VCTK_p302': 71, 'VCTK_p303': 72, 'VCTK_p304': 73, 'VCTK_p305': 74, 'VCTK_p306': 75, 'VCTK_p307': 76, 'VCTK_p308': 77, 'VCTK_p310': 78, 'VCTK_p311': 79, 'VCTK_p312': 80, 'VCTK_p313': 81, 'VCTK_p314': 82, 'VCTK_p316': 83, 'VCTK_p317': 84, 'VCTK_p318': 85, 'VCTK_p323': 86, 'VCTK_p326': 87, 'VCTK_p329': 88, 'VCTK_p330': 89, 'VCTK_p333': 90, 'VCTK_p334': 91, 'VCTK_p335': 92, 'VCTK_p336': 93, 'VCTK_p339': 94, 'VCTK_p340': 95, 'VCTK_p341': 96, 'VCTK_p343': 97, 'VCTK_p345': 98, 'VCTK_p347': 99, 'VCTK_p351': 100, 'VCTK_p360': 101, 'VCTK_p361': 102, 'VCTK_p363': 103, 'VCTK_p364': 104, 'VCTK_p374': 105, 'VCTK_p376': 106, 'VCTK_s5': 107, 'VCTK_old_new_voice': 0}

As you can see voice VCTK_p225 and VCTK_old_new_voice (my new voice loaded with formated vctk_old) has both id 0 after I pass my new voice in the DATASETS_CONFIG_LIST Is this a problem?

Also, it looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything en

It should not effect the training or inference because we use the speaker name and not the Ids. But yeah it is weird and can cause confusion, I fixed it on https://github.com/coqui-ai/TTS/pull/2234/commits/c8245cde075911a2137d7963feff2abbf48d5d07

iamkhalidbashir commented 1 year ago

Amazing! Quick question: Q1: It looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything to en

Q2: To fine-tune the new voice do I really need vctk dataset or the restored model path is enough? and how many steps for a good quality fine-tuning of my voice should I expect?

Thanks

iamkhalidbashir commented 1 year ago

For example this (https://voca.ro/168hwUXVT4wy) looks like some when speaking in another language although I have set en as the language for my data set

My config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 10,
    "batch_size": 24,
    "eval_batch_size": 24,
    "grad_clip": [
        1000.0,
        1000.0
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "phoneme_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 160000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": [
                "p261",
                "p225",
                "p294",
                "p347",
                "p238",
                "p234",
                "p248",
                "p335",
                "p245",
                "p326",
                "p302"
            ],
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        },
        {
            "formatter": "vctk_old",
            "dataset_name": "newspeaker",
            "path": "/workspace/project/datasets/madaliene",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_old_new_voice",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_old_new_voice",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 62,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1.0,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-23-2022_10+35AM-c8245cde/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth",
            "/workspace/project/datasets/madaliene/speakers.pth"
        ],
        "speaker_embedding_channels": 512,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": false,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": {
        "speaker_name": 1.0
    },
    "weighted_sampler_multipliers": {
        "speaker_name": {}
    },
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-23-2022_10+35AM-c8245cde/speakers.pth",
    "speaker_embedding_channels": 512,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth",
        "/workspace/project/datasets/madaliene/speakers.pth"
    ],
    "d_vector_dim": 512
}
iamkhalidbashir commented 1 year ago

Also @Edresson @erogol I think there is a bug/misconfiguration in this line as well https://github.com/coqui-ai/TTS/blob/0910cb76bcd85df56bf43654bb31427647cdfd0d/recipes/vctk/yourtts/train_yourtts.py#L206-L209

Error:

 > EPOCH: 0/10
 --> /content/project/output/YourTTS-EN-VCTK-December-26-2022_08+20AM-c8245cde

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: False
| > Number of instances : 39959
 ! Run is kept in /content/project/output/YourTTS-EN-VCTK-December-26-2022_08+20AM-c8245cde
 | > Preprocessing samples
 | > Max text length: 388
 | > Min text length: 10
 | > Avg text length: 40.86955977676002
 | 
 | > Max audio length: 159781.0
 | > Min audio length: 7698.0
 | > Avg audio length: 24937.56375603774
 | > Num. instances discarded samples: 2
 | > Batch group size: 50.
 > Using weighted sampler for attribute 'speaker_name' with alpha '1.0'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1591, in fit
    self._fit()
  File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1544, in _fit
    self.train_epoch()
  File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1292, in train_epoch
    self.train_loader = self.get_train_dataloader(
  File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 803, in get_train_dataloader
    return self._get_loader(
  File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 767, in _get_loader
    loader = model.get_data_loader(
  File "/content/project/TTS/TTS/tts/models/vits.py", line 1621, in get_data_loader
    sampler = self.get_sampler(config, dataset, num_gpus)
  File "/content/project/TTS/TTS/tts/models/vits.py", line 1554, in get_sampler
    multi_dict = config.weighted_sampler_multipliers.get(attr_name, None)
AttributeError: 'NoneType' object has no attribute 'get'
An exception has occurred, use %tb to see the full traceback.

SystemExit: 1
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3334: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

The training is failing because I think weighted_sampler_multipliers must be supplied!

Edresson commented 1 year ago

weighted_sampler_multipliers

On my side the training works well. weighted_sampler_multipliers is supplied as an empty dict as default so it should works.

iamkhalidbashir commented 1 year ago

Amazing! Quick question: Q1: It looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything to en

Q2: To fine-tune the new voice do I really need vctk dataset or the restored model path is enough? and how many steps for a good quality fine-tuning of my voice should I expect?

Thanks

Could you please take a look at this

Edresson commented 1 year ago

Q1: Not sure what is happening. I fine-tuned the model with the recipe per 3 epochs and the voices look great. Try to run the recipe as it is on PR #2234 (without your dataset).

Q2: It depends on how much data you have. I recommend the original training data + the new speaker samples as we did on YourTTS paper.

iamkhalidbashir commented 1 year ago

I recommend the original training data Doesn't this download the original training data from the yourtts paper? https://github.com/coqui-ai/TTS/blob/0a9d28def1ac168540198836701fcfc9d665aa0d/recipes/vctk/yourtts/train_yourtts.py#L52-L56

Edresson commented 1 year ago

I recommend the original training data Doesn't this download the original training data from the yourtts paper?

https://github.com/coqui-ai/TTS/blob/0a9d28def1ac168540198836701fcfc9d665aa0d/recipes/vctk/yourtts/train_yourtts.py#L52-L56

For the first experiment yes. In Experiment 1, YourTTS was trained using only VCTK dataset.

iamkhalidbashir commented 1 year ago

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it's because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc... while my config uses it as VCTK_p231, VCTK_p232, etc... Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?

The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge. However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

Got it! So should I ignore the message (as mentioned in above logs):

 | > 724 / 896 layers are restored.
 > Model restored from step 0

This happens when I use "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth" as restore path

Yes, the following weights will be not loaded "speakerencoder." (because we changed this part and this is now on speaker_manager), "emb_l.weight" (because it is not a multilingual training) and "duration_predictor.condlang." (because it is not a multilingual training).

After struggling for days I found why I had issue with my fine-tuned model, its because yourtts model is multilingual so I had to turn on use_language_embedding=True In order to guide my new model what language to train on