A video example with TTS voice - over using several speakers:

Brief

We embarked from this implementation for the start:https://github.com/ming024/FastSpeech2

However, we have made several changes so the code is not identical.

For example:

We use masking for input grapheme tokens during training;
CWT was implemented as in the original paper but we did not observe any improvements. Final model was trained without CWT. But you can train a model on your data with it: use_cwt flag in config;
Data preprocessing is slightly different, especially in langauge specific parts.

Dataset:

Russian dataset was borrowed from here https://github.com/vlomme/Multi-Tacotron-Voice-Cloning. We did not use all the speakers and filtered them based on length and records quality. Only 65 speakers were used at the end. You can check all the examples in 'examples'.

MFA:

MFA was trained from scartch after preprocessing text with russian_g2p. Using MFA might be not straightforward, so we refer to this manual: https://github.com/ivanvovk/DurIAN#6-how-to-align-your-own-data

Usage

We use russian_g2p, so you will need to install it first.

git init git clone https://github.com/nsu-ai/russian_g2p.git\ cd russian_g2p pip3 install -r requirements.txt pip install .
Then Install requirements.txt
Download weights: https://drive.google.com/drive/folders/1dX7ELe9C9-ja_liYrgph3Uu5Z5EMljjh?usp=sharing
- Move hifi gan and FS2 weights into 'pretrained';
- Check that paths in config match;
- tts.weights_path - path to pretrained FastSpeech model;
- add speakers_json to the same folder as model weights - speaker names, it should be there right now for pretrained model;
- add sats_json to the same folder as model weights - raw data pitch and energy stats;
- hifi.weights_path - path to pretrnained HIFI Gan.
If all above is set check the notebook "examples.ipynb"

Training your own model

Assuming you preprocessed the data with MFA aligner. Your folders structure should be following:

data
├── speaker_one
│   ├── record_1.TextGrid  # genrated by MFA
│   ├── record_1.wav      
│   └── record_1.lab       # just a text file with a text string
│       
└── speaker_two
    ├── ...
    └── ...

Once data is organized and the path to the data is set in config 'raw_path' run prepare_data.py.
Prepare_data.py will generate more files such as energy and pitch into a folder set by 'preprocessed_path'
Finally, set a path to a lexicon dict. Words and its translitirations generated by rissian_g2p. If you do not use rissian_g2p your dictionary will be different. An example can found in 'pretrained' folder.

diff7 / tts-king