CODEJIN / GST_Tacotron

Implementation of Global Style Token Tacotron in TensorFlow2
MIT License
25 stars 10 forks source link
multispeaker tacotron tensorflow2

GST Tacotron in TF2

This code is an implementation of the paper 'Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis'. The algorithm is based on the following papers:

Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R. J., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017.
He, M., Deng, Y., & He, L. (2019). Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS. arXiv preprint arXiv:1906.00672.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Saurous, R. A. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779-4783). IEEE.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.

Update

Requirements

Please see the 'Requirements.txt'

Structrue

Structure

Currently, model supports only grffin lim vocoder. Other vocoder is one of the future works.

Used dataset

Currently uploaded code is compatible with the following datasets. The O mark to the left of the dataset name is the dataset actually used in the uploaded result.

[O] LJSpeech: https://keithito.com/LJ-Speech-Dataset/
[X] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[X] TIMIT: http://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3
[X] Blizzard Challenge 2013: http://www.cstr.ed.ac.uk/projects/blizzard/
[O] FastVox: http://www.festvox.org/cmu_arctic/index.html

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameter.json' according to your environment.

Generate pattern

Command

python Pattern_Generate.py [parameters]

Parameters

At least, one or more of datasets must be used.

Inference file path while training for verification.

Run

Command

python Model.py [parameters]

Parameters

Inference

  1. Run 'ipython' in the model's directory.
  2. Run following command:
    from Model import GST_Tacotron
    new_GST_Tacotron = GST_Tacotron(is_Training= False)
    new_GST_Tacotron.Restore()
  3. Set the speaker's Wav path list and text list like the following example:
sentence_List = [
    'The grass is always greener on the other side of the fence.',
    'Strike while the iron is hot.',
    'A creative artist works on his next composition because he was not satisfied with his previous one.',
    'You cannot make an omelet without breaking a few eggs.',
    ]
wav_List_for_GST = [
    './Wav_for_Inference/FV.AWB.arctic_a0001.wav',
    './Wav_for_Inference/FV.JMK.arctic_a0004.wav',
    './Wav_for_Inference/FV.SLT.arctic_a0007.wav',
    './Wav_for_Inference/LJ.LJ050-0278.wav',
    ]

※The length of wav path must be 1 or same to text list.

  1. Run following command:
    new_GST_Tacotron.Inference(
    sentence_List = sentence_List,
    wav_List_for_GST = wav_List_for_GST,
    label = 'Result'
    )

GST embedding inference

  1. Do until 2 of Inference

  2. Set the Wav path list and tag list like the following example:

    wav_List = [
    './Wav_for_Inference/FV.AWB.arctic_a0001.wav'
    './Wav_for_Inference/FV.JMK.arctic_a0004.wav'
    './Wav_for_Inference/FV.SLT.arctic_a0007.wav'
    './Wav_for_Inference/LJ.LJ050-0278.wav'
    ]
    tag_List = [
    'AWB'
    'JMK'
    'SLT'
    'LJ'
    ]

    ※The length of two lists must be same.

  3. Run following command:

mels, stops, spectrograms, alignments = new_GST_Tacotron.Inference_GST(wav_List, tag_List)
  1. The result is saved as a text file in inference directory. You can get the t-SNE analysis graph by using R script

Result

Mel for GST: FastVox AWB A0001

Wav_IDX_0 Figure_IDX_0

Mel for GST: FastVox BDL A0002

Wav_IDX_1 Figure_IDX_1

Mel for GST: FastVox CLB A0003

Wav_IDX_2 Figure_IDX_2

Mel for GST: FastVox JMK A0004

Wav_IDX_3 Figure_IDX_3

Mel for GST: FastVox KSP A0005.wav

Wav_IDX_4 Figure_IDX_4

Mel for GST: FastVox.RMS A0006

Wav_IDX_5 Figure_IDX_5

Mel for GST: FastVox.SLT A0007

Wav_IDX_6 Figure_IDX_6

Mel for GST: LJspeech LJ050-0278

Wav_IDX_7 Figure_IDX_7

GST embedding t-SNE

GST_Embedding

Trained checkpoint

Checkpoint here

Future works

  1. Vocoder attaching. (I am focusing several vocdoers....)

    Prenger, R., Valle, R., & Catanzaro, B. (2019, May). Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3617-3621). IEEE.
    Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
    Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., ... & Kavukcuoglu, K. (2018). Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435.
    Yamamoto, R., Song, E., & Kim, J. M. (2020, May). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199-6203). IEEE.
    Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., ... & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems (pp. 14881-14892).
  2. Tacotron 1 module update

    • Original paper used the tacotron 1, not tacotron 2.
    • I hope to add tacotron 1 for performance comparison and more.