Plachtaa / VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io/vallex/
MIT License
7.59k stars 756 forks source link

Robotics results using my voice #88

Open hdnh2006 opened 1 year ago

hdnh2006 commented 1 year ago

Hello, thanks for the amazing job you have done.

I have tried your model with my own voice and I am getting poor results, I have attached both audios (don't clone my voice please) so you can have an idea about what is happening.

These are the commands I am runing:

from utils.prompt_making import make_prompt

### Alternatively, use whisper
make_prompt(name="henry", audio_prompt_path="examples/henry.wav")

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# download and load all models
preload_models()

text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="henry")

write_wav("henry_cloned.wav", SAMPLE_RATE, audio_array)

Any information about how to improve the results? henry.zip

Plachtaa commented 1 year ago

I found the cloned voice in your attached file very nice and clear. Which aspect do you think the synthesized voice is poor and robotic?

Plachtaa commented 1 year ago

Probably is because your recorded audio is in stereo while the cloned voice is in mono. The script will make a prompt with only the first channel of your input prompt if it contains more than one channel, which is probably the reason why you found the voice not to resemble yourself.

RahulBhalley commented 11 months ago

I wonder if VALLEX still needs more training steps. It is not always good at synthesizing realistic-ness of reference voice.