LuckyBian / EMOTTS

This is a TTS model based on VITS that can control the output speech emotion through natural language and control the speaker through reference audio.
4 stars 1 forks source link
emotion-styled emotional-speech multispeaker-speech-synthesis tts

EMOTTS: Multilingual Emotion-Controlled Voice Cloning Text-to-Speech System

<img src="img/emo-tts.png" style="float: left; margin-right: 0px;" />

Create Env

conda create -n emo python=3.8
conda activate emo

Install packages

pip install -r requirements.txt
python env.py

Download Pre-trained Model

Download the model by this link, and then put them into /chinese-roberta-wwm-ext

Collecting Data

Collet the data by this

Preprocessing

Use this code to complete the following preprocessing:

  1. Change the audio to single channel, sampling rate to 22050, format to wav.
  2. Merge and slice the audio into 10s segments.
  3. Use ASR technology to recognize text in speech.
  4. Store the audio, emotion and text in 3 folders with corresponding file names.
# The audio path and corresponding text and emotion are stored and divided into training set and validation set.
python getdata.py
python split.py

Build Monotonic Alignment Search

cd monotonic_align
python setup.py build_ext --inplace
cd ..

Training

python train.py -c path/to/json -m model

Inference

python infer.py