coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
33.55k stars 4.08k forks source link

[Bug] Can't clone a voice well #3110

Closed CHIHHSIANGLI closed 10 months ago

CHIHHSIANGLI commented 10 months ago

Describe the bug

I've tried the latest xtts v1.1 model to clone a voice, it only took one sample and the output voice doesn't sound like the target voice. I've also tried to finetune xttsv1.1 by giving a half hour audio to make it sound more like the target speaker, however, it still need a reference voice, even I input a reference voice, the output audio still doesn't sound like the target speaker. Anyone knows how to make a good clone?

To Reproduce

import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts

Add here the xtts_config path

CONFIG_PATH = "run/training/GPT_XTTS_TRUMP-October-26-2023_10+45PM-edd3a287/config.json"

Add here the vocab file that you have used to train the model

TOKENIZER_PATH = "run/training/XTTS_v1.1_original_model_files/vocab.json"

Add here the checkpoint that you want to do inference with

XTTS_CHECKPOINT = "run/training/GPT_XTTS_TRUMP-October-26-2023_10+45PM-edd3a287/best_model.pth"

Add here the speaker reference

SPEAKER_REFERENCE = "trump1.wav"

output wav path

OUTPUT_WAV_PATH = "xtts-ft.wav"

print("Loading model...") config = XttsConfig() config.load_json(CONFIG_PATH) model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False) model.cuda()

print("Computing speaker latents...") gpt_cond_latent, diffusion_conditioning, speaker_embedding = model.get_conditioning_latents(audio_path=SPEAKER_REFERENCE)

print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, diffusion_conditioning, temperature=0.7, # Add custom parameters here ) torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "TTS": "0.19.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.18",
        "version": "#1 SMP Fri Jan 27 02:56:13 UTC 2023"
    }
}

Additional context

No response

erogol commented 10 months ago

This is not a bug. Sometimes it doesn't work well. You should try better and cleaner audio samples for better cloning.

Nanshanelectrician commented 9 months ago

I also have this issue, the voice has always been a female voice, the reference audio is ignored, and there is also noise tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST")

dididiskq commented 2 months ago

I also have this issue, the voice has always been a female voice, the reference audio is ignored, and there is also noise tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST")

Same as you do.There is only one female voice when use tts_models/zh-CN/baker/tacotron2-DDC-GST.dont clone other voice.