Emotional-Text-to-Speech / dl-for-emo-tts

:computer: :robot: A summary on our attempts at using Deep Learning approaches for Emotional Text to Speech :speaker:
MIT License
430 stars 44 forks source link
affective-computing dc-tts deep-learning emotional-tts lj-speech ravdess speech-synthesis tacotron tacotron-models tts

Deep Learning for Emotional Text-to-speech

A summary on our attempts at using Deep Learning approaches for Emotional Text to Speech

Open In Colab DOI demo


Contents


Datasets

Dataset No. of Speakers Emotions No. of utterances No. of unique prompts Duration Language Comments Pros Cons
RAVDESS 24 (12 female, 12 male) 8 (calm, neutral, happy, sad, angry, fearful, surprise, and disgust) 1440 2 ~1 hour English
  • Each speaker has 4 utterances for neutral emotion and 8 utterances for all other emotions, leading to 60 utterances per speaker
  • Easily available
  • Emotions contained are very easy to interpret
  • Very limited utterances
  • Poor vocabulary
  • Same utterance in different voices
EMOV-DB 5 (3 male, 2 female) 5 (neutral, amused, angry sleepy, disgust) 6914 (1568, 1315, 1293, 1720, 1018) 1150 ~7 hours English, French (1 male speaker)
  • An attempt at a large scale corpus for Emotional speech
  • The Amused emotion contains non-verbal cues like chuckling, etc. which do not show up in the transcript
  • Similarly, Sleepiness has yawning sounds.
  • Only large scale emotional corpus that we found freely available
  • Emotions covered are not very easy to interpret
  • The non-verbal cues make synthsis difficult
  • Also, not all emotions are available for all speakers
LJ Speech 1 (1 female) NA (can be considered neutral) 13100 13100 23 hours 55 minutes 17 seconds English
  • This is one of the largest corpuses for speech generation, with a rich vocabulary of over ~14k unique words
  • The sentences are taken from 7 non-fiction books
  • Large scale corpus
  • Rich vocabulary
  • Abbreviations in text are expanded in speech
  • No emotional annotations are available
IEMOCAP 10 (5 female, 5 male) 9 (anger, happiness, excitement, sadness, frustration, fear, surprise, other and neutral state) 10039 NA 12.5 hours English
  • This dataset consists of 10 actors (5 male; 5 female) of multi-modal dyadic conversations, which were scripted and enacted by a group of actors
  • Variety of utterances
  • Rich vocabulary
  • Multi-modal input
  • Easy to interpret emotional annotations
  • The access is very restricted and upon being granted access, we got a corrupted archive file.

Relevant literature

There are many more relevant papers that build up on the Vanilla Tacotron model. However, for the scope of our project, we restricted ourselves to these three papers.

Approach: Tacotron Models

:x: Approach 1: Fine-tuning a Vanilla Tacotron model on RAVDESS pre-trained on LJ Speech

Our first approach was to train a vanilla Tacotron model from scratch on just one emotion (say, anger) and see if the generated voice has captured the prosodic features of that emotion.

Motivation

Observations

Inference and next steps

:x: Approach 2: Using a smaller learning rate for fine-tuning

In this approach, we repeated Approach 1, but this time, while commencing the fine-tuning, we stuck with a smaller learning rate of 2e-5 as compared to the previous learning rate of 2e-3. We did not make any other changes to the code or hyperparameters.

Motivation

Observations

Inference and next steps

:x: Approach 3: Using a smaller learning rate and SGD for fine-tuning

In this approach, we repeated Approach 2, but this time, while commencing the fine-tuning, we also switch the optimiser from Adam to SGD.

Motivation

Observations

Inference and next steps

:x: Approach 4: Freezing the encoder and postnet

In this approach, we repeated Approach 3, with the addition of sending only the decoder's parameters for optimisation.

Motivation

Observations

Inference and next steps

:x: Approach 5: Freezing the encoder and postnet, and switching back to Adam

For the sake of completeness, we also repeated Approach 4 but this time with Adam as the optimiser.

Motivation

Observations

Inference and next steps

:white_check_mark: Approach 6: Freezing just the post-net, using Adam with low initial learning rate, training on EMOV-DB

The experiments inspired from the preprint based on DC-TTS have been described below. We also thought of applying the strategy of DC-TTS to our Vanilla Tacotron strategy. Additionally, for each emotion, we used only one female speaker's data for every emotion. Details on how data was picked is given in Approach 8.

Motivation

Observations

Inference and next steps

Approach: DCTTS Models

:x: Approach 7: Fine-tuning the Text2Mel module of the DC-TTS model on EMOV-DB pre-trained on LJ Speech

We started off with obtaining a pre-trained DC-TTS model on LJ Speech from this PyTorch implementation of DC-TTS. In this repository, a pre-trained model of DC-TTS was fine-tuned on Mongolian Speech Data, and we started of by exploring if the same process helps transfer emotional cues to the generated speech.

We also simultaneously came across this work that explores Transfer Learning methods for low-resource emotional TTS. In their approach, they decided to keep the SSRN module frozen while finetuning, because SSRN does the mapping between MFBs and full spectrogram. Therefore, it should not depend on the speaker identity on speaking style as it is just trained to do the mapping between two audios. The entire Text2Mel module was fine-tuned on a single emotion (Anger) of EMOV-DB.

Motivation

Observations

Inference and next steps

:white_check_mark: Approach 8: Fine-tuning only on one speaker with reduced top_db and monotonic attention

In this approach, we repeated the steps in Approach 1. In accordance with the pre-processing steps described in the preprint, we made two small changes:

Additionally, we also only used the data for one female speaker per emotion. The details for the files from EMOV-DB used for each speaker are elaborated below:

Emotion Speaker
Anger jenie
Disgust bea
Sleepiness bea
Amused bea

Motivation

Observations

Inference and next steps

Reproducibility and Code

Approach Dataset Result Dumps Optimiser Learning Rate Training Script Slides
Approach 1 RAVDESS (angry) approach_1 Adam 2e-3 train.py [slides]
Approach 2 RAVDESS (angry) approach_2 Adam 2e-5 train.py [slides]
Approach 3 RAVDESS (angry) approach_3 SGD 2e-5 train_sgd.py [slides]
Approach 4 RAVDESS (angry) approach_4 SGD 2e-5 train_fr_enc_sgd.py [slides]
Approach 5 RAVDESS (angry) approach_5 Adam 2e-5 train_fr_enc_adam.py [slides]
Approach 6 EMOV-DB (each emotion, one speaker) approach_6 Adam 2e-5 train_fr_postnet_adam.py [slides]
Approach 7 EMOV-DB (angry) approach_7 - - - [slides]
Approach 8 EMOV-DB (each emotion, one speaker) approach_8 - - - [slides]

Demonstration

In order to view a working demonstration of the models, open the file Demo_DL_Based_Emotional_TTS.ipynb and click on Open in Colab. Follow the steps as mentioned in the Colab Notebook.

Models used in our code are here: demo_models

Cite

If you find the models, code or approaches in this repository helpful, please consider citing this repository as follows:

@software{brihi_joshi_2020_3876081,
  author       = {Brihi Joshi and
                  Aditya Chetan and
                  Pulkit Madaan and
                  Pranav Jain and
                  Srija Anand and
                  Eshita and
                  Shruti Singh},
  title        = {{An exploration into Deep Learning methods for 
                   Emotional Text-to-Speech}},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v1.0.0},
  doi          = {10.5281/zenodo.3876081},
  url          = {https://doi.org/10.5281/zenodo.3876081}
}

Contact

For any errors or help in running the project, please open an issue or write to any of the project members -