This module contains code in support of the paper Emotion Recognition in Speech using Cross-Modal Transfer in the Wild. The experiments are implemented using the MatConvNet framework.
PyTorch versions of the vision models are available here.
The easiest way to use this module is to install it with the vl_contrib
package manager. mcnCrossModalEmotions
can be installed with the following commands from the root directory of your MatConvNet installation:
vl_contrib('install', 'mcnCrossModalEmotions') ;
vl_contrib('setup', 'mcnCrossModalEmotions') ;
mcnCrossModalEmotions requires a working MatConvNet installation, and the following modules (which will be downloaded automatically):
Some of the scripts for analysis require vlfeat (although this is not required for training).
The project page has a high level overview of the work.
Usage: To use the teacher code, running exps/bencmark_ferplus_models.m will download the pretrained teacher models and data and launch an evaluation function. New teacher models can be trained with the exps/ferplus_baselines.m function.
The following pretrained teacher CNNs are available:
model | pretraining | training | Fer2013+ Val | Fer2013+ Test |
---|---|---|---|---|
resnet50-ferplus | VGGFace2 | Fer2013+ | 89.0 | 87.6 |
senet50-ferplus | VGGFace2 | Fer2013+ | 89.8 | 88.8 |
More details relating to the models can be found here.
In the paper, we show that there is sufficient signal to learn emotion-predictive embeddings in the student, but that it is a very noisy task. We validate that the student has learned something useful by testing it on external datasets for speech emotion recognition and showing that it can do quite a lot better than random, but as one would expect, not as well as a model trained with speech labels (Table 5 in the paper).
The distillation experiment can be re-run using the run_distillation.m script.
By treating the dominant prediction of the teacher as kind of ground-truth one-hot label, we can also assess whether the student is to match some portion of the teacher's predictive signal. It's worth noting that since we are using interview data to perform the distillation (because this is what VoxCeleb consists of), emotions such as neutral and happiness are better represented in the videos. For each emotion below, we first show the ROC curve for the student on the training set, followed by predictions on the test set of "heard" identities in the second column, then "unheard" identities in the third column.
The idea here is that by looking at the performance on previously heard/unheard identities we can get some idea as to whether it is trying to solve the task by "cheating". In this case, cheating would correspond to exploiting some bias in the dataset by memorising the identity of the speaker, rather than listening to the emotional content of their speech.
We find that the student is able to learn a weak classifier for the dominant emotion predicted by the teacher, hinting that there may be a small redundant signal between the (human annotated) facial emotion of a speaker and their (human annotated) speech emotion. These figures can be reproduced using the student_stats.m script.
Anger
Happiness
Neutral
Surprise
Sadness
Note that since the dataset is highly unbalanced, some of the emotions have very few samples - the remaining emotions (fear, disgust and contempt) are rarely predicted as dominant by the teacher on the validation sets.
If you find the models or code useful, please consider citing:
@InProceedings{Albanie18a,
title = "Emotion Recognition in Speech using Cross-Modal Transfer in the Wild",
author = "Albanie, S. and Nagrani, A. and Vedaldi, A. and Zisserman, A.",
booktitle = "ACM Multimedia",
year = "2018",
}
References for the related datasets are the FER2013+ dataset:
@inproceedings{BarsoumICMI2016,
title={Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution},
author={Barsoum, Emad and Zhang, Cha and Canton Ferrer, Cristian and Zhang, Zhengyou},
booktitle={ACM International Conference on Multimodal Interaction (ICMI)},
year={2016}
}
and the VoxCeleb dataset:
@article{nagrani2017voxceleb,
title={Voxceleb: a large-scale speaker identification dataset},
author={Nagrani, Arsha and Chung, Joon Son and Zisserman, Andrew},
journal={arXiv preprint arXiv:1706.08612},
year={2017}
}