Deep Learning for Emotional Text-to-speech
A summary on our attempts at using Deep Learning approaches for Emotional Text to Speech
Contents
Datasets
Dataset |
No. of Speakers |
Emotions |
No. of utterances |
No. of unique prompts |
Duration |
Language |
Comments |
Pros |
Cons |
RAVDESS |
24 (12 female, 12 male) |
8 (calm, neutral, happy, sad, angry, fearful, surprise, and disgust) |
1440 |
2 |
~1 hour |
English |
- Each speaker has 4 utterances for neutral emotion and 8 utterances for all other emotions, leading to 60 utterances per speaker
|
- Easily available
- Emotions contained are very easy to interpret
|
- Very limited utterances
- Poor vocabulary
- Same utterance in different voices
|
EMOV-DB |
5 (3 male, 2 female) |
5 (neutral, amused, angry sleepy, disgust) |
6914 (1568, 1315, 1293, 1720, 1018) |
1150 |
~7 hours |
English, French (1 male speaker) |
- An attempt at a large scale corpus for Emotional speech
- The Amused emotion contains non-verbal cues like chuckling, etc. which do not show up in the transcript
- Similarly, Sleepiness has yawning sounds.
|
- Only large scale emotional corpus that we found freely available
|
- Emotions covered are not very easy to interpret
- The non-verbal cues make synthsis difficult
- Also, not all emotions are available for all speakers
|
LJ Speech |
1 (1 female) |
NA (can be considered neutral) |
13100 |
13100 |
23 hours 55 minutes 17 seconds |
English |
- This is one of the largest corpuses for speech generation, with a rich vocabulary of over ~14k unique words
- The sentences are taken from 7 non-fiction books
|
- Large scale corpus
- Rich vocabulary
- Abbreviations in text are expanded in speech
|
- No emotional annotations are available
|
IEMOCAP |
10 (5 female, 5 male) |
9 (anger, happiness, excitement, sadness, frustration, fear, surprise, other and neutral state) |
10039 |
NA |
12.5 hours |
English |
- This dataset consists of 10 actors (5 male; 5 female) of multi-modal dyadic conversations, which were scripted and enacted by a group of actors
|
- Variety of utterances
- Rich vocabulary
- Multi-modal input
- Easy to interpret emotional annotations
|
- The access is very restricted and upon being granted access, we got a corrupted archive file.
|
Relevant literature
- Tacotron: Towards End-to-End Speech Synthesis
- An extremely influential paper in the are of Neural Text-to-speech. The idea can be abstracted to a simple encoder-decoder network, that takes as input the ground-truth audio and textual transcript.
- The reconstruction loss of the generated audio drives the training of the model.
- This was one of the architectures that we explored in this project. We also presented details about this paper in a class lecture. [slides]
- Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
- This work was done by the same team that developed Tacotron.
- The core idea was to improve the expressiveness of the generated speech, by incorporating "Style Tokens" which was basically an additional embedding layer for the ground-truth audio, which was used to condition the generated audio, so that transfer of "prosodic features" could occur.
- We also explored this model, and presented it for a class lecture. [slides]
- However, we did not explore this as extensively as the Tacotron, as it took a lot of time and resources to train.
- Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
- This work aimed at more efficient text-to-speech generation by using fully convolutional layers with guided attention.
- We came across this work while looking for resources for efficient TTS systems that could be fine-tunes with low amount of data.
There are many more relevant papers that build up on the Vanilla Tacotron model. However, for the scope of our project, we restricted ourselves to these three papers.
Approach: Tacotron Models
:x: Approach 1: Fine-tuning a Vanilla Tacotron model on RAVDESS pre-trained on LJ Speech
Our first approach was to train a vanilla Tacotron model from scratch on just one emotion (say, anger) and see if the generated voice has captured the prosodic features of that emotion.
Motivation
- We did not have acccess to any of the datasets described above except for RAVDESS and LJ Speech and had also never tried any of the Tacotron-flavored models before.
- Hence, we just wanted to play around initially and at least generate the results on LJ Speech, and analyse the quality of speech generated.
- The fine-tuning idea seemed natural after the pre-training was done, as the RAVDESS dataset was extremely limited, there was no point on training on it from scratch, as the vocabulary that the model was exposed to would be extremely low.
- We were hoping that at best, less amount of fine-tuning would lead to transfer of prosodic features to the model and at worst, after fine-tuning for a long interval, would lead to over-fitting on the dataset.
Observations
- The alignment of the encoder and decoder states was completely destroyed in the first 1000 iterations of training itself.
- At test time, generated audio was initially empty. On further analysis, we discovered that this was because of the restriction on the decoder's stopping criteria.
- If the all the values in the generated frames were below a certain threshold, the decoder would stop producing the new frames. We observed that in our case, this was happening at the beginning itself.
- To fix this, we removed this condition and instead made the decoder produce sounds for a minimum number of iterations.
- We observed that for lesser iterations of finetuning (1-3k iterations) the audio was produced was complete noise, with no intelligible speech.
- If we fine-tune for long durations (~40k iterations), we observe that the model is able to generate angry speech for the utterances that are in the training set. However, even for utterances outside the training set, it speaks parts of the training set utterances only, indicating that the model has overfitted on this dataset.
Inference and next steps
- The observations presented above seemed to present a case of "catastrophic forgetting" where the model was forgetting the information that it had already learnt in the pre-training rates.
- To counter this, we were advised to tweak the hyperparameters and training strategy of the model, such as learning rate, optimiser used, etc.
- We decided to try out this following approaches:
- Start the fine-tuning steps with a lower learning rate: Pre-training was done at 0.002, so we decided to do fine-tuning with 2e-5. Note that the code also implemented annealing learning rate strategy, where learning rate was reduced after few steps. We did not change it as it had given good results at pre-training.
- Changing the optimizer from Adam to SGD: Because the number of samples used for fine-tuning were less, and SGD has been known to generalise better for a smaller sample size, we decided to do this.
- Freezing the Encoder of the Tacotron while fine-tuning: We thought of this because the main purpose of the encoder is to convert the text to a latent space. Since LJ Speech had a better vocabulary either way, we did not feel the need to re-train this component of the model over RAVDESS' much inferior voabulary size.
:x: Approach 2: Using a smaller learning rate for fine-tuning
In this approach, we repeated Approach 1, but this time, while commencing the fine-tuning, we stuck with a smaller learning rate of 2e-5 as compared to the previous learning rate of 2e-3. We did not make any other changes to the code or hyperparameters.
Motivation
- Again, uptil this point, we had not discovered any new, larger corpus for Emotional Speech. Hence, we were stuck with RAVDESS again.
- Drawing from the previous experiment, we wanted to see if the effect of the pre-trained Tacotron model, forgetting its weights could be mitigated by reducing the effect of the new gradients that are used for update during fine-tuning.
- To verify this, we reduced the learning rate of the model from 2e-3 to 2e-5.
Observations
- We did not observe any pattern, or any improvements as such in the initial iterations of fine-tuning (after 1k, 3k and 5k iterations).
- Simply changing the initial learning rate, did not seem to have any effect on the training process.
- The alignment would still get destroyed, and there was not audio generated at test time again.
Inference and next steps
- We decided that simply toggling the learning rate would not help with the training.
- To help with generalization, we decided to try out replacing the Adam optimiser which was being used as default, with SGD at the time of fine-tuning.
:x: Approach 3: Using a smaller learning rate and SGD for fine-tuning
In this approach, we repeated Approach 2, but this time, while commencing the fine-tuning, we also switch the optimiser from Adam to SGD.
Motivation
Observations
- Again, we could not observe any clear progression from the alignment plots.
- The utterances that were seen in the test set could be produced at test time. However, the model was not able to generate any unseen utterances.
Inference and next steps
- Since even in this case, the model was not able to generalise properly, we felt that this could be due to two reasons:
- The model had not even started learning anything in the fine-tuning stage
- Even all these measures to prevent "catastrophic forgetting" were not helping the model to retain information
- We decided to go with assumption 2 for now. To further aid the network in retaining its gained knowledge, we decided that we would only backpropagate the gradients through the decoder module of the Tacotron.
:x: Approach 4: Freezing the encoder and postnet
In this approach, we repeated Approach 3, with the addition of sending only the decoder's parameters for optimisation.
Motivation
- Freezing the encoder's parameters would further help the model in retaining what it learnt in the previous layers.
- We hypothesized that since the encoder learns a much richer vocabulary while being trained on LJ Speech, therefore there was no merit in re-training the encoder on the much inferior vocabulary of RAVDESS.
- Similarly, the postnet merely learns a mapping from the Mel-space to a Linear-space, which would not change for new data. Hence, it can be frozen too.
Observations
- We got the same results as Approach 3, and the observations were inconclusive.
Inference and next steps
- Since the observations were inconclusive, we felt that probably the only way to resolve the problem was to get more emotionally annotated data.
- By this time, we also felt that it would make sense to only consider data spoken by a single speaker.
- We started looking for even non-published sources to see if there were any resources on low-resource training of Neural TTS models.
:x: Approach 5: Freezing the encoder and postnet, and switching back to Adam
For the sake of completeness, we also repeated Approach 4 but this time with Adam as the optimiser.
Motivation
- This was mostly done for the sake of completeness
- We had not tried the different changes we had made with Adam, so we felt that it might make sense to go back to it.
Observations
- The results had virtually no change from Approach 4.
Inference and next steps
- We had discovered the preprint, "Exploring Transfer Learning for Low Resource Emotional TTS", and the for the next few attempts that we made, we focussed on trying to replicate the method given here.
- We also decided to shift our experiments form RAVDESS to EMOV-DB, as the preprint was working with this dataset.
- EMOV-DB also is larger in size as compared to RAVDESS and with a richer vocabulary, albeit the emotional labels are a bit difficult to interpret perceptively.
:white_check_mark: Approach 6: Freezing just the post-net, using Adam with low initial learning rate, training on EMOV-DB
The experiments inspired from the preprint based on DC-TTS have been described below. We also thought of applying the strategy of DC-TTS to our Vanilla Tacotron strategy. Additionally, for each emotion, we used only one female speaker's data for every emotion. Details on how data was picked is given in Approach 8.
Motivation
- Using a single speaker for each emotion separately:
- We discovered on online forums dedicated to TTS systems, such as the Mozilla discourse channel, the even SOTA Neural TTS systems, perform miserably in case of fine-tuning on multi-speaker datasets. [post]
- Freezing the postnet only:
- In the case of DC-TTS, we were retraining the entire Text2Mel module, which is responsible for mapping the input text to a Mel-spectrogram.
- We do not tamper with the SSRN, which learns a mel to linear mapping.
- Analogously, in the Tacotron, we fine-tune the Encoder + Decoder, which are responsible for mappint input text to the Mel-spectrogram.
- We leave the Post net which learns a mapping from the Mel-spectrogram to the Linear-spectrogram.
- Using a lower learning rate:
- We just wanted to be cautious not to erase the previous weights of the pre-trained model completely
- Using Adam as optimiser:
- There was no specific reason to do this. Maybe SGD would have worked better here. We have not checked this!
Observations
- We tried three emotions through this approach, Disgust, Sleepiness and Amused through this approach.
- All of them showed extremely improved results! We could hear intelligible speech with emotions too!
- The alignment plots were also greatly improved and could be seen to improve with increase in fine-tuning steps.
Inference and next steps
- The idea of freezing only the post-net seemed to work wonders.
- It would be interesting to investigate the effect of changing optimisers and learning rates in this setup. It would help ascertain how much of a role the changed data played in the improved performance.
Approach: DCTTS Models
:x: Approach 7: Fine-tuning the Text2Mel module of the DC-TTS model on EMOV-DB pre-trained on LJ Speech
We started off with obtaining a pre-trained DC-TTS model on LJ Speech from this PyTorch implementation of DC-TTS. In this repository, a pre-trained model of DC-TTS was fine-tuned on Mongolian Speech Data, and we started of by exploring if the same process helps transfer emotional cues to the generated speech.
We also simultaneously came across this work that explores Transfer Learning methods for low-resource emotional TTS. In their approach, they decided to keep the SSRN module frozen while finetuning, because SSRN does the mapping between MFBs and full spectrogram. Therefore, it should not depend on the speaker identity on speaking style as it is just trained to do the mapping between two audios. The entire Text2Mel module was fine-tuned on a single emotion (Anger) of EMOV-DB.
Motivation
- We wanted to see if the default settings used for transfer learning on the Mongolian dataset works for emotional data.
- Because so far, Tacotron was not leading to any results, trying out something that is claimed to work felt like a natural next step.
Observations
- Pre-trained DC-TTS on LJ Speech worked fine with the
synthesize.py
script in the given repository.
- For the fine-tuned model, even though the audio generated was not of 0 length, it did not contain any audio. The melspectrograms (mels) and the magnitude spectrograms (mags) were also completely empty.
Inference and next steps
- We initially suspected that because we were not fine-tuning the SSRN module, it was leading to the blank audios. However, on delving deeper, we found that it was Text2Mel which was not even generating the required output, as the mel-spectrograms generated by it were blank.
- We then explored the finer details of this work.
:white_check_mark: Approach 8: Fine-tuning only on one speaker with reduced top_db
and monotonic attention
In this approach, we repeated the steps in Approach 1. In accordance with the pre-processing steps described in the preprint, we made two small changes:
Additionally, we also only used the data for one female speaker per emotion. The details for the files from EMOV-DB used for each speaker are elaborated below:
Emotion |
Speaker |
Anger |
jenie |
Disgust |
bea |
Sleepiness |
bea |
Amused |
bea |
- The data files for these speakers can be downloaded from the EMOV-DB repository (link in Datasets table above). We downloaded files from the sorted version of the files from the link given in the repository)
Motivation
- The change in
top_db
was solely motivated by replicating the preprint's pipeline.
- The monotonic attention was also suggested by the preprint. And it made sense also, as monotonic attention helps induce some semblance of a temporal structure to the model.
- Lastly, using one female speaker was also motivated by the preprint. It does not mention which female speaker was used. So we used the speaker who had a higher number of utterances for each emotion. Using female speaker made sense, as LJ Speech also has a female voice, and we hoped it would reduce the amount of learning that the model had to do.
- The idea for using a single speaker on a single emotion was also discussed on online forums on TTS systems, like the Mozilla discourse channel, where in this [post] the users have discussed that TTS systems, even the SOTA, do not perform well on multi-speaker low-resource datasets, which we felt also justified taking this approach.
Observations
- For the first time, we saw good-quality, generalisable speech for an emotion! Anger emotion was getting generated quite properly!
- However, with the other emotions, the previous problems still persisted. The generated spectrograms were still blank for all other emotions :cry:
Inference and next steps
- Amused and Sleepiness were challenging emotions to learn in the first place. This was because of the presence of non-verbal cues like chuckling, yawning, etc, which are absent from the transcripts. The preprint said the same thing about these emotions.
- For Disgust, on plotting the mel-spectrograms of some ground-truth samples, we discovered that on the temporal axis, the perceptual distinction between successive temporal frames was lower as compared to Anger. We believe that this was the reason that the model was not able to generate Disgust properly. However, this is just a speculation.
Reproducibility and Code
- For Tacotron, we worked on our modified fork of r9y9's repository. To reproduce our results, you can use our fork.
- For DC-TTS, we worked on our modified fork of tugstugi's repository. Again, to reproduce our results, you can use our fork.
- Below, for each approach, we have specified the location of the saved models, the training script to run for an approach, the dataset used, and a link to the slides we made for a detailed presentation of results.
- For Tacotron-based approaches, the learning rate can be changed by editing the
initial_learning_rate
parameter in hparams.py
- Note that for the DC-TTS approaches, we have not specified Learning Rate, Optimiser and Training script as they do not change.
Demonstration
In order to view a working demonstration of the models, open the file Demo_DL_Based_Emotional_TTS.ipynb
and click on Open in Colab
. Follow the steps as mentioned in the Colab Notebook.
Models used in our code are here: demo_models
Cite
If you find the models, code or approaches in this repository helpful, please consider citing this repository as follows:
@software{brihi_joshi_2020_3876081,
author = {Brihi Joshi and
Aditya Chetan and
Pulkit Madaan and
Pranav Jain and
Srija Anand and
Eshita and
Shruti Singh},
title = {{An exploration into Deep Learning methods for
Emotional Text-to-Speech}},
month = jun,
year = 2020,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.3876081},
url = {https://doi.org/10.5281/zenodo.3876081}
}
Contact
For any errors or help in running the project, please open an issue or write to any of the project members -
- Pulkit Madaan (pulkit16257 [at] iiitd [dot] ac [dot] in)
- Aditya Chetan (aditya16217 [at] iiitd [dot] ac [dot] in)
- Brihi Joshi (brihi16142 [at] iiitd [dot] ac [dot] in)
- Pranav Jain (pranav16255 [at] iiitd [dot] ac [dot] in)
- Srija Anand (srija17199 [at] iiitd [dot] ac [dot] in)
- Eshita (eshita17149 [at] iiitd [dot] ac [dot] in)
- Shruti Singh (shruti17211 [at] iiitd [dot] ac [dot] in)