Closed CHIHHSIANGLI closed 10 months ago
This is not a bug. Sometimes it doesn't work well. You should try better and cleaner audio samples for better cloning.
I also have this issue, the voice has always been a female voice, the reference audio is ignored, and there is also noise tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST")
I also have this issue, the voice has always been a female voice, the reference audio is ignored, and there is also noise tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST")
Same as you do.There is only one female voice when use tts_models/zh-CN/baker/tacotron2-DDC-GST.dont clone other voice.
Describe the bug
I've tried the latest xtts v1.1 model to clone a voice, it only took one sample and the output voice doesn't sound like the target voice. I've also tried to finetune xttsv1.1 by giving a half hour audio to make it sound more like the target speaker, however, it still need a reference voice, even I input a reference voice, the output audio still doesn't sound like the target speaker. Anyone knows how to make a good clone?
To Reproduce
import os import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts
Add here the xtts_config path
CONFIG_PATH = "run/training/GPT_XTTS_TRUMP-October-26-2023_10+45PM-edd3a287/config.json"
Add here the vocab file that you have used to train the model
TOKENIZER_PATH = "run/training/XTTS_v1.1_original_model_files/vocab.json"
Add here the checkpoint that you want to do inference with
XTTS_CHECKPOINT = "run/training/GPT_XTTS_TRUMP-October-26-2023_10+45PM-edd3a287/best_model.pth"
Add here the speaker reference
SPEAKER_REFERENCE = "trump1.wav"
output wav path
OUTPUT_WAV_PATH = "xtts-ft.wav"
print("Loading model...") config = XttsConfig() config.load_json(CONFIG_PATH) model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False) model.cuda()
print("Computing speaker latents...") gpt_cond_latent, diffusion_conditioning, speaker_embedding = model.get_conditioning_latents(audio_path=SPEAKER_REFERENCE)
print("Inference...") out = model.inference( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", "en", gpt_cond_latent, speaker_embedding, diffusion_conditioning, temperature=0.7, # Add custom parameters here ) torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
Expected behavior
No response
Logs
No response
Environment
Additional context
No response