We are currently exploring Text-to-Speech (TTS) models to identify an optimal solution for our project. Our primary requirements are as follows:

Fine-Tuning Capability or Voice Cloning: The model must either support fine-tuning with our own datasets or be capable of generating speech that closely mimics a sample voice input. This feature is crucial for creating a personalized user experience.
Model Size and Efficiency: Given our deployment constraints, the model's size and computational requirements are important considerations. We aim to strike a balance between performance and resource utilization.
Response Time: To facilitate real-time interactions in a chatbot application, the model must generate speech with minimal latency. The goal is to achieve natural-sounding and seamless conversations without noticeable delays.
Speech Quality: High-quality, natural-sounding speech output is essential. We seek models that produce clear and lifelike audio, enhancing the overall user engagement.

Text-to-Speech (TTS) Model Research Report

What are Text-to-Speec (TTS) Models?

Text-to-Speech models are used for creating a speech from a given text. They take a text as a parameter and returns the speech of that text. Here’s an example usage:

text = "Hello, this is a demo of text to speech.
tts_model(text)

The Main Problem of TTS

When a long text is given, the response time can be very very long. This is the issue we will probably face. Other issue it that the speech can be very robotic and not human-like. In addition to these, for our project, we need to find a TTS model that is either finetunable or takes sample voice input to create a voice that is similar to the input voice.

Tested Models

Following models are tested for the project. Jupyter notebooks for the tested models can be found here.

XTTS-v2 HuggingFace Link / Demo Link
MetaVoice-1B HuggingFace Link / Demo Link
OpenVoice HuggingFace Link / Demo Link
SpeechT5 HuggingFace Link / Demo Link

Additinonally, we tested the OpenAI TTS through the OpenAI API. But it is not included in the list above because it is not open-source and it is not finetunable.

Sound Cleaning & Seperation

To seperate and get a clean sound of the target, mpariente/DPRNNTasNet-ks2_WHAM_sepclean model is used. It is a pretrained model that can be used for seperating sounds from a given audio file. Demo can be found here. To seperate sounds locally, following code snippet in text-to-speech/preprocess/voice_isolation.ipynb folder:

model = BaseModel.from_pretrained("mpariente/DPRNNTasNet-ks2_WHAM_sepclean")
model.separate("voice_input.wav")

Sound Creation

After the sound is cleaned and seperated, we can use a TTS model to create a sound from the text. There are multiples sound models to test under the text-to-speech/models. For example, the sound can be created by using the following code snippet for metavoice after cloning the repository:

!python fam/llm/sample.py --spk_cond_path="voice_input.mp3" --text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model by MetaVoice."

Conclusion

Among the tested models, XTTS currently stands out as the most suitable option. Its ability to be customized, along with its superior voice naturalness compared to competitors and adequate response time, makes it the preferred choice. During implementation, if the response time caused by the Text-to-Speech and Chatbot exceeds expectations, models with slightly lower sound quality but faster response times may be considered as alternatives.

Dijital-Twin / model

research: Text-to-Speech Model #12