CHIHHSIANGLI commented 10 months ago

Describe the bug

I've tried the latest xtts v1.1 model to clone a voice, it only took one sample and the output voice doesn't sound like the target voice. I've also tried to finetune xttsv1.1 by giving a half hour audio to make it sound more like the target speaker, however, it still need a reference voice, even I input a reference voice, the output audio still doesn't sound like the target speaker. Anyone knows how to make a good clone?

To Reproduce

import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts

Add here the xtts_config path

CONFIG_PATH = "run/training/GPT_XTTS_TRUMP-October-26-2023_10+45PM-edd3a287/config.json"

Add here the vocab file that you have used to train the model

TOKENIZER_PATH = "run/training/XTTS_v1.1_original_model_files/vocab.json"

Add here the checkpoint that you want to do inference with

XTTS_CHECKPOINT = "run/training/GPT_XTTS_TRUMP-October-26-2023_10+45PM-edd3a287/best_model.pth"

Add here the speaker reference

SPEAKER_REFERENCE = "trump1.wav"

output wav path

OUTPUT_WAV_PATH = "xtts-ft.wav"

print("Loading model...") config = XttsConfig() config.load_json(CONFIG_PATH) model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False) model.cuda()

print("Computing speaker latents...") gpt_cond_latent, diffusion_conditioning, speaker_embedding = model.get_conditioning_latents(audio_path=SPEAKER_REFERENCE)

print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, diffusion_conditioning, temperature=0.7, # Add custom parameters here ) torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "TTS": "0.19.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.18",
        "version": "#1 SMP Fri Jan 27 02:56:13 UTC 2023"
    }
}

Additional context

No response

erogol commented 10 months ago

This is not a bug. Sometimes it doesn't work well. You should try better and cleaner audio samples for better cloning.

Nanshanelectrician commented 9 months ago

I also have this issue, the voice has always been a female voice, the reference audio is ignored, and there is also noise tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST")

dididiskq commented 2 months ago

I also have this issue, the voice has always been a female voice, the reference audio is ignored, and there is also noise tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST")

Same as you do.There is only one female voice when use tts_models/zh-CN/baker/tacotron2-DDC-GST.dont clone other voice.

coqui-ai / TTS

[Bug] Can't clone a voice well #3110

Describe the bug

To Reproduce

Add here the xtts_config path

Add here the vocab file that you have used to train the model

Add here the checkpoint that you want to do inference with

Add here the speaker reference

output wav path

Expected behavior

Logs

Environment

Additional context