Multispeaker GlowTTS

This code is a replication of official Glow TTS code. If you want to use Glow TTS model, I recommend that you refer to the official code.
The following is the paper I referred:

Requirements

torch >= 1.5.1
tensorboardX >= 2.0
librosa >= 0.7.2
matplotlib >= 3.1.3
Optional for loss flow
- tensorboard >= 2.2.2

Structure

Vanilla mode (Single speaker GlowTTS)

### Training

### Inference

Speaker embedding mode

### Training

### Inference

Prosody encoding mode (GST GlowTTS)

### Training

### Inference

Gradient reversal mode (Voice cloning GlowTTS - Failed)

### Training

### Inference

Used dataset

Currently uploaded code is compatible with the following datasets.
The O marks to the left of the dataset name are the dataset actually used in the uploaded result.

Single	Multi	Dataset	Dataset address
O	O	LJSpeech	https://keithito.com/LJ-Speech-Dataset/
X	X	BC2013	http://www.cstr.ed.ac.uk/projects/blizzard/
X	O	CMU Arctic	http://www.festvox.org/cmu_arctic/index.html
X	O	VCTK	https://datashare.is.ed.ac.uk/handle/10283/2651
X	X	LibriTTS	https://openslr.org/60/

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameters.yaml' according to your environment.

Sound
- Setting basic sound parameters.
- Some paramters like pitch are not used in current code. These are for future works.
Use_Cython_Alignment
- Setting which implementation of Monotonic alignment search to use
- If true, the cython implementation of official code will be used.
- If false, the python implementation will be used.
- I recommend to use cython implementation because of speed.
  - But, to use cython implementation, you must complie this before running.
  - Please refer following: https://github.com/jaywalnut310/glow-tts#2-pre-requisites
Encoder
- Setting the encoder parameters
Decoder
- Setting the glow decoder parameters.
WaveNet
- Setting the parameters of Vocoder.
- This implementation uses a pre-trained Parallel WaveGAN model.
  - https://github.com/CODEJIN/PWGAN_Torch
- If checkpoint path is null, model does not exports wav files.
- If checkpoint path is not null, all parameters must be matched to pre-trained Parallel WaveGAN model.
Speaker_Embedding
- Setting the speaker embedding generating method
- In Type, you can select null, 'LUT', 'GE2E'
  - null: No speaker embedding. Single speaker version
  - LUT: Model will generate a lookup table about the speakers.
  - GE2E: Model will use d-vectors which is generated by a pretrained GE2E model.
    - Pretrained GE2E model is from Speaker_Embedding_Torch
Token path
- Setting the token-to-index dict.
- Pattern generator makes this file.
Train
- Setting the parameters of training.
Inference_Batch_Size
- Setting the batch size when inference.
- If null, it will be same to Train/Batch_Size
Inference_Path
- Setting the inference path
Checkpoint_Path
- Setting the checkpoint path
Log_Path
- Setting the tensorboard log path
Use_Mixed_Precision
- Setting mixed precision.
- To use, Nvidia apex must be installed in the environment.
- In several preprocessing hyper parameters, loss overflow problem occurs.
Device
- Setting which GPU device is used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'.

Generate pattern

Command

python Pattern_Generate.py [parameters]

Parameters

At least, one or more of datasets must be used.

-lj
- Set the path of LJSpeech. LJSpeech's patterns are generated.
-bc2013
- Set the path of Blizzard Challenge 2013. Blizzard Challenge 2013's patterns are generated.
-cmua
- Set the path of CMU arctic. CMU arctic's patterns are generated.
-vctk
- Set the path of VCTK. VCTK's patterns are generated.
-libri
- Set the path of LibriTTS. LibriTTS's patterns are generated.
-vc1
- Set the path of VoxCeleb1. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
-vc2
- Set the path of VoxCeleb2. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
-vc1t
- Set the path of VoxCeleb1 testset. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
-text
- Set whether the text information save or not.
- This is for other model. To use in Glow TTS, this option must be set.
-evalr
- Set the evaluation pattern ratio.
- Default is 0.001.
-evalm
- Set the evaluation pattern minimum of each speaker.
- Default is 1.
-mw
- The number of threads used to create the pattern
- Default is 10.

Run

Command

python Train.py -s <int>

-s <int>
- The resume step parameter.
- Default is 0.
- When this parameter is 0, model try to find the latest checkpoint in checkpoint path.

Inference

Please check example files for the inference
- Inference_Example.ipynb
- Inference.py

Result

Please see at the demo site

Trained checkpoint

Mode	Dataset	Trained steps	Link
Vanilla	LJ	100000	Link(Broken)
SE & LUT	LJ + CUMA	100000	Link
SE & LUT	LJ + VCTK	100000	Link
PE	LJ + CUMA	100000	Link
PE	LJ + VCTK	400000	Link
GR & LUT	LJ + VCTK	400000	Link(Failed)

Future works

Training with GE2E speaker embedding
Gradient reversal model structure improvement
Training additional steps

CODEJIN / Glow_TTS

readme

Multispeaker GlowTTS

Requirements

Structure

Vanilla mode (Single speaker GlowTTS)

Speaker embedding mode

Prosody encoding mode (GST GlowTTS)

Gradient reversal mode (Voice cloning GlowTTS - Failed)

Used dataset

Hyper parameters

Generate pattern

Command

Parameters

Run

Command

Inference

Result

Trained checkpoint

Future works