Multispeaker GlowTTS
Requirements
-
torch >= 1.5.1
-
tensorboardX >= 2.0
-
librosa >= 0.7.2
-
matplotlib >= 3.1.3
-
Optional for loss flow
Structure
Vanilla mode (Single speaker GlowTTS)
### Training
### Inference
Speaker embedding mode
### Training
### Inference
Prosody encoding mode (GST GlowTTS)
### Training
### Inference
Gradient reversal mode (Voice cloning GlowTTS - Failed)
### Training
### Inference
Used dataset
- Currently uploaded code is compatible with the following datasets.
- The O marks to the left of the dataset name are the dataset actually used in the uploaded result.
Hyper parameters
Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameters.yaml' according to your environment.
-
Sound
- Setting basic sound parameters.
- Some paramters like pitch are not used in current code. These are for future works.
-
Use_Cython_Alignment
- Setting which implementation of Monotonic alignment search to use
- If
true
, the cython implementation of official code will be used.
- If
false
, the python implementation will be used.
- I recommend to use cython implementation because of speed.
-
Encoder
- Setting the encoder parameters
-
Decoder
- Setting the glow decoder parameters.
-
WaveNet
- Setting the parameters of Vocoder.
- This implementation uses a pre-trained Parallel WaveGAN model.
- If checkpoint path is
null
, model does not exports wav files.
- If checkpoint path is not
null
, all parameters must be matched to pre-trained Parallel WaveGAN model.
-
Speaker_Embedding
- Setting the speaker embedding generating method
- In
Type
, you can select null
, 'LUT'
, 'GE2E'
null
: No speaker embedding. Single speaker version
LUT
: Model will generate a lookup table about the speakers.
GE2E
: Model will use d-vectors which is generated by a pretrained GE2E model.
-
Token path
- Setting the token-to-index dict.
- Pattern generator makes this file.
-
Train
- Setting the parameters of training.
-
Inference_Batch_Size
- Setting the batch size when inference.
- If
null
, it will be same to Train/Batch_Size
-
Inference_Path
- Setting the inference path
-
Checkpoint_Path
- Setting the checkpoint path
-
Log_Path
- Setting the tensorboard log path
-
Use_Mixed_Precision
- Setting mixed precision.
- To use,
Nvidia apex
must be installed in the environment.
- In several preprocessing hyper parameters, loss overflow problem occurs.
-
Device
- Setting which GPU device is used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'.
Generate pattern
Command
python Pattern_Generate.py [parameters]
Parameters
At least, one or more of datasets must be used.
- -lj
- Set the path of LJSpeech. LJSpeech's patterns are generated.
- -bc2013
- Set the path of Blizzard Challenge 2013. Blizzard Challenge 2013's patterns are generated.
- -cmua
- Set the path of CMU arctic. CMU arctic's patterns are generated.
- -vctk
- Set the path of VCTK. VCTK's patterns are generated.
- -libri
- Set the path of LibriTTS. LibriTTS's patterns are generated.
- -vc1
- Set the path of VoxCeleb1. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
- -vc2
- Set the path of VoxCeleb2. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
- -vc1t
- Set the path of VoxCeleb1 testset. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
- -text
- Set whether the text information save or not.
- This is for other model. To use in Glow TTS, this option must be set.
- -evalr
- Set the evaluation pattern ratio.
- Default is
0.001
.
- -evalm
- Set the evaluation pattern minimum of each speaker.
- Default is
1
.
- -mw
- The number of threads used to create the pattern
- Default is
10
.
Run
Command
python Train.py -s <int>
-s <int>
- The resume step parameter.
- Default is 0.
- When this parameter is 0, model try to find the latest checkpoint in checkpoint path.
Inference
- Please check example files for the inference
Result
Please see at the demo site
Trained checkpoint
Mode |
Dataset |
Trained steps |
Link |
Vanilla |
LJ |
100000 |
Link(Broken) |
SE & LUT |
LJ + CUMA |
100000 |
Link |
SE & LUT |
LJ + VCTK |
100000 |
Link |
PE |
LJ + CUMA |
100000 |
Link |
PE |
LJ + VCTK |
400000 |
Link |
GR & LUT |
LJ + VCTK |
400000 |
Link(Failed) |
Future works
- Training with GE2E speaker embedding
- Gradient reversal model structure improvement
- Training additional steps