[Bug] Voice conversion with VITS not working

Frederieke93 commented 1 year ago

Describe the bug

Hi all! I've been finetuning the VITS model on my own dataset, that has two speakers. After training, I wanted to do voice conversion from speaker 1 (speaker_idx) to speaker 2 (reference_speaker_idx) with a reference_wav (from speaker 2). I tried synthesizing as follows:

!python /content/drive/MyDrive/VoiceCloning/TTS/TTS/bin/synthesize.py --model_path {model_path}  \
--config_path {config_path} \
--out_path {output_path_se} \
--language_idx {language_idx} \
--speaker_idx {speaker_idx} \
--reference_wav {reference_wav} \
--reference_speaker_id {reference_speaker_idx}

However I get the following error message: TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not int. Which can be found in the following file:

  File "./TTS/tts/models/vits.py", line 1207, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)

To Reproduce

I did finetuning with the following config: config_vits_v2.json

{
    "output_path": "/content/drive/MyDrive/VoiceCloning/model_output/model_finetune_output_vits",
    "logger_uri": null,
    "run_name": "finetuning_vits",
    "project_name": "finetune_vits",
    "run_description": "",
    "print_step": 25,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 1000,
    "save_n_checkpoints": 5,
    "save_checkpoints": true,
    "save_all_best": true,
    "save_best_after": 10000,
    "target_loss": null,
    "print_eval": true,
    "test_delay_epochs": -1,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 100,
    "batch_size": 24,
    "eval_batch_size": 24,
    "grad_clip": [
        5.0,
        5.0
    ],
    "scheduler_after_epoch": true,
    "lr": 1e-5,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": "",
    "lr_scheduler_params": {},
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": true,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 4,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 22050,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "nl",
    "compute_input_seq_cache": false,
    "text_cleaner": "phoneme_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": "",
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? 1234567890",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 0,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": Infinity,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 0,
    "start_by_longest": false,
    "datasets": [
        {
            "name": "speaker_1",
            "path": "/content/drive/MyDrive/VoiceCloning/datasets/speaker_1/",
            "meta_file_train": "metadata.csv",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        },
        {
            "name": "speaker_2",
            "path": "/content/drive/MyDrive/VoiceCloning/datasets/speaker_2/",
            "meta_file_train": "metadata.csv",
            "ignored_speakers": null,
            "language": "nl",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "This is a test of my voice",
            "speaker_1",
            null,
            "en"
        ],
        [
            "Dit is een test van mijn stem",
            "speaker_2",
            null,
            "nl"
        ]
    ],
    "eval_split_max_size": null,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 175,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 6,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "1",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1.0,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 0.8,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": true,
        "num_speakers": 2,
        "speakers_file": null,
        "d_vector_file": "",
        "speaker_embedding_channels": 512,
        "use_d_vector_file": false,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": true,
        "embedded_language_dim": 4,
        "num_languages": 2,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": false,
        "speaker_encoder_config_path": "",
        "speaker_encoder_model_path": "",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 1.0,
    "return_wav": true,
    "r": 1,
    "num_speakers": 2,
    "use_speaker_embedding": true,
    "speakers_file": null,
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": true,
    "use_d_vector_file": false,
    "d_vector_file": "",
    "d_vector_dim": 512
}

Expected behavior

The expected behavior was an output audio file, where the reference_wav was spoken by speaker_1

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.1+cu113",
        "TTS": "0.8.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.8.15",
        "version": "#1 SMP Fri Aug 26 08:44:51 UTC 2022"
    }
}

Additional context

Thank you for helping me out!!

manmay-nakhashi commented 1 year ago

@Frederieke93 you need to use speaker_encoder_as _loss to do vc and use dvector files while training.

Frederieke93 commented 1 year ago

Thank you for your response. However I thought it would also be possible to do voice conversion with the normal VITS model that doesn't use the speaker encoder (and not uses a d-vector). Is that not correct?

erogol commented 1 year ago

@Edresson 👀

Edresson commented 1 year ago

Thank you for your response. However I thought it would also be possible to do voice conversion with the normal VITS model that doesn't use the speaker encoder (and not uses a d-vector). Is that not correct?

Yeah, indeed It is a bug. PR #2187 fix this issue. Example of the command to do voice conversion using the released VCTK VITS model: tts --model_name "tts_models/en/vctk/vits" --reference_wav p226_001_mic1.flac --reference_speaker_idx "p226" --speaker_idx "p225"

Darth-Carrotpie commented 1 year ago

Similar issue/bug when using Python API. When running this code: Is it by any chance related @Edresson?

tts = TTS("tts_models/en/vctk/vits")
tts.tts_with_vc_to_file(
    text = text_input,
    speaker_wav="target/speaker.wav",
    file_path="ouptut/vits.wav"
)

Receiving this error:

[/usr/local/lib/python3.9/dist-packages/TTS/api.py](https://localhost:8080/#) in _check_arguments(self, speaker, language, speaker_wav, emotion, speed)
    428             # check for the coqui tts models
    429             if self.is_multi_speaker and (speaker is None and speaker_wav is None):
--> 430                 raise ValueError("Model is multi-speaker but no `speaker` is provided.")
    431             if self.is_multi_lingual and language is None:
    432                 raise ValueError("Model is multi-lingual but no `language` is provided.")

ValueError: Model is multi-speaker but no `speaker` is provided.

But after adding a parameter 'speaker':

tts = TTS("tts_models/en/vctk/vits")
tts.tts_with_vc_to_file(
    text = text_input,
    speaker=tts.speakers[7],
    speaker_wav="target/speaker.wav",
    file_path="ouptut/vits.wav"
)

The parameter is not recognized:

TypeError                                 Traceback (most recent call last)
[<ipython-input-25-5194fa66d3f0>](https://localhost:8080/#) in <cell line: 2>()
      1 tts = TTS("tts_models/en/vctk/vits")
----> 2 tts.tts_with_vc_to_file(
      3     text = text_input,
      4     speaker=tts.speakers[7],
      5     speaker_wav="target/speaker.wav",

TypeError: tts_with_vc_to_file() got an unexpected keyword argument 'speaker'

Edresson commented 1 year ago

Similar issue/bug when using Python API. When running this code: Is it by any chance related @Edresson?

tts = TTS("tts_models/en/vctk/vits")
tts.tts_with_vc_to_file(
    text = text_input,
    speaker_wav="target/speaker.wav",
    file_path="ouptut/vits.wav"
)

Receiving this error:

[/usr/local/lib/python3.9/dist-packages/TTS/api.py](https://localhost:8080/#) in _check_arguments(self, speaker, language, speaker_wav, emotion, speed)
    428             # check for the coqui tts models
    429             if self.is_multi_speaker and (speaker is None and speaker_wav is None):
--> 430                 raise ValueError("Model is multi-speaker but no `speaker` is provided.")
    431             if self.is_multi_lingual and language is None:
    432                 raise ValueError("Model is multi-lingual but no `language` is provided.")

ValueError: Model is multi-speaker but no `speaker` is provided.

But after adding a parameter 'speaker':

tts = TTS("tts_models/en/vctk/vits")
tts.tts_with_vc_to_file(
    text = text_input,
    speaker=tts.speakers[7],
    speaker_wav="target/speaker.wav",
    file_path="ouptut/vits.wav"
)

The parameter is not recognized:

TypeError                                 Traceback (most recent call last)
[<ipython-input-25-5194fa66d3f0>](https://localhost:8080/#) in <cell line: 2>()
      1 tts = TTS("tts_models/en/vctk/vits")
----> 2 tts.tts_with_vc_to_file(
      3     text = text_input,
      4     speaker=tts.speakers[7],
      5     speaker_wav="target/speaker.wav",

TypeError: tts_with_vc_to_file() got an unexpected keyword argument 'speaker'

No it is not. It is not a bug. tts_with_vc_to_file() Is not designed to do voice conversion. It is designed to generate speech and then convert to a new voice using a voice conversion model like FreeVC. Also it do not support multi speaker models (given that the parameter speaker is not implemented). If you want to do uses directly "voice_conversion_to_file()" method. However, it will not works with VITS model because it is a traditional multi-speaker model and cant receives a wav reference and copy the speaker characteristic from it.

No it is not. It is not a bug. tts_with_vc_to_file() Is not designed to do voice conversion. It is designed to generate speech and then convert to a new voice using a voice conversion model like FreeVC. Also it do not support multi speaker models (given that the parameter speaker is not implemented). If you want to do voice conversion uses directly "voice_conversion_to_file()" method. However, it will not works with VITS model because it is a traditional multi-speaker model and cant receives a wav reference and copy the speaker characteristic from it.

coqui-ai / TTS