We embarked from this implementation for the start:https://github.com/ming024/FastSpeech2
However, we have made several changes so the code is not identical.
For example:
Russian dataset was borrowed from here https://github.com/vlomme/Multi-Tacotron-Voice-Cloning. We did not use all the speakers and filtered them based on length and records quality. Only 65 speakers were used at the end. You can check all the examples in 'examples'.
MFA was trained from scartch after preprocessing text with russian_g2p. Using MFA might be not straightforward, so we refer to this manual: https://github.com/ivanvovk/DurIAN#6-how-to-align-your-own-data
We use russian_g2p, so you will need to install it first.
git init git clone https://github.com/nsu-ai/russian_g2p.git\ cd russian_g2p pip3 install -r requirements.txt pip install .
Then Install requirements.txt
Download weights: https://drive.google.com/drive/folders/1dX7ELe9C9-ja_liYrgph3Uu5Z5EMljjh?usp=sharing
Move hifi gan and FS2 weights into 'pretrained';
Check that paths in config match;
tts.weights_path - path to pretrained FastSpeech model;
add speakers_json to the same folder as model weights - speaker names, it should be there right now for pretrained model;
add sats_json to the same folder as model weights - raw data pitch and energy stats;
hifi.weights_path - path to pretrnained HIFI Gan.
If all above is set check the notebook "examples.ipynb"
data
├── speaker_one
│ ├── record_1.TextGrid # genrated by MFA
│ ├── record_1.wav
│ └── record_1.lab # just a text file with a text string
│
└── speaker_two
├── ...
└── ...
Once data is organized and the path to the data is set in config 'raw_path' run prepare_data.py.
Prepare_data.py will generate more files such as energy and pitch into a folder set by 'preprocessed_path'
Finally, set a path to a lexicon dict. Words and its translitirations generated by rissian_g2p. If you do not use rissian_g2p your dictionary will be different. An example can found in 'pretrained' folder.