NaturalSpeech 2

This code is a unofficial implementation of NaturalSpeech 2.
The algorithm is based on the following paper:

Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., ... & Bian, J. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv preprint arXiv:2304.09116.

Modifications from Paper

The structure is derived from NaturalSpeech 2, but I made several modifications.
The audio codec has been changed to HifiCodec from AcademiCodec.
- This is done to reduce the time spent training a separate audio codec.
- The model uses 22.05Khz audio, but no audio resampling is applied.
- To maintain similarity with the paper, it may be better to apply Google's SoundStream instead of HifiCodec, but I couldn't apply SoundStream to this repository because official pyTorch source code or pretrained model was not provided.
  - Although this repository does not use, there is also a c++ or tflite version of Lyra, which may allow the application of SoundStream using it.
- Meta's Encodec 24K version was also tested, but it could not be trained.
About CE-RVQ
- I have observed that the quality of the model with CE-RVQ applied is decreased, so it has been removed from the current implementation.
- However, this does not necessarily mean that CE-RVQ is ineffective.
- It could be due to the change of codec in this repository or the presence of bugs in the implemented module.
- This aspect requires further validation and investigation.
Information on the segment length σ of the speech prompt during training was not found in the paper and was arbitrarily set.
- The σ = 3, 5, and 10 seconds used in the evaluation of paper are too long to apply to both the variance predictor and diffusion during training.
- To ensure stability in pattern usage, half the length of the shortest pattern used in each training is set as σ for each training.
The target duration is obtained through Alignment learning framework (ALF), rather than being brought in externally.
- Using external modules such as Montreal Force Alignment (MFA) may have benefits in terms of training speed or stability, but I prioritized simplifying the training process.
Padding is applied between tokens like 'A <P> B <P> C ....'
- I could not verify whether there was a difference in performance depending on its usage.

Supported dataset

To apply zero-shot reported in the paper, I believe that it is necessary to have as many speakers as possible in the training data, but I were unable to test Multilingual LibriSpeech due to current environment.
Tested
- LJ Dataset
- VCTK Dataset
  - This repository used the VCTK092 from Torchaudio(https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip)
Supported but not tested
- LibriTTS Dataset
- Mulitilingual LibriSpeech Dataset
  - Only English language dataset generation is tested.

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.

Sound
- Setting basic sound parameters.
Tokens
- The number of token.
- After pattern generating, you can see which tokens are included in the dataset at Token_Path.
Audio_Codec
- Setting the audio codec.
- This repository is using HifiCodec, so only the size of the latents output from HifiCodec's encoder is set for reference in other modules.
Train
- Setting the parameters of training.
Inference_Batch_Size
- Setting the batch size when inference
Inference_Path
- Setting the inference path
Checkpoint_Path
- Setting the checkpoint path
Log_Path
- Setting the tensorboard log path
Use_Mixed_Precision
- Setting using mixed precision
Use_Multi_GPU
- Setting using multi gpu
- By the nvcc problem, Only linux supports this option.
- If this is True, device parameter is also multiple like 0,1,2,3.
- And you have to change the training command also: please check multi_gpu.sh.
Device
- Setting which GPU devices are used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)

Generate pattern

python Pattern_Generate.py [parameters]

Parameters

-lj
- The path of LJSpeech dataset
-vctk
- The path of VCTK dataset
-libri
- The path of LbiriTTS dataset
-hp
- The path of hyperparameter.

About phonemizer

To phoneme string generate, this repository uses phonimizer library.
Please refer here to install phonemizer and backend

In Windows, you need more setting to use phonemizer.

Please refer here

In conda enviornment, the following commands are useful.

conda env config vars set PHONEMIZER_ESPEAK_PATH='C:\Program Files\eSpeak NG'
conda env config vars set PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'

Run

Command

Single GPU

python Train.py -hp <path> -s <int>

-hp <path>
- The hyper paramter file path
- This is required.
-s <int>
- The resume step parameter.
- Default is 0.
- If value is 0, model try to search the latest checkpoint.

Multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322

I recommend to check the multi_gpu.sh.

Checkpoint

Dataset		SR		Link
VCTK		22050		Google drive

This checkpoint was trained in a single GPU environment (RTX4090 x 1) with the VCTK dataset.
It has limited quality compared to the official demo, and there are issues with the generation for unseen reference.
While checking the loss flow, I observed the possibility of loss decreasing as training progresses, but it doesn't guarantee an improvement in quality.
Unfortunately, I don't have personal resources for testing beyond the current state, so I'm releasing the checkpoint after discontinuation.

CODEJIN / NaturalSpeech2

readme