Authors official PyTorch implementation of the EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition. If you use this code for your research, please cite our paper.
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Niki Maria Foteinopoulou and Ioannis Patras
Abstract: Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues), as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions, on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted Average Recall and 5\% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85, which is comparable to human experts' agreement.
In a nutshell, we follow the CLIP contrastive training paradigm to jointly optimise a video and a text encoder. The video and text encoders of the network are jointly trained using a contrastive loss over the cosine similarities of the video-text pairings in the mini-batch. More specifically, the video encoder ($E_V$) is composed of the CLIP image encoder ($E_I$) and a Transformer Encoder, to learn the temporal relationships of the frame spatial representations. The text encoder ($E_T$) used in our approach is the CLIP text encoder. The weights of the image and text encoders in our model are initialised using the large pre-trained weights of CLIP, as FER datasets are not large enough to train a VLM from scratch with adequate generalisation. Contrary to the previous video VLM works in both action recognition and FER, we propose using sample level descriptions for better representation learning, rather than embeddings of class prototypes. This leads to more semantically rich representations which in turn allows for better generalisation.
We recommend installing the required packages using python's native virtual environment as follows:
$ python -m venv venv
$ source venv/bin/activate
(venv) $ pip install --upgrade pip
(venv) $ pip install -r requirements.txt
For using the aforementioned virtual environment in a Jupyter Notebook, you need to manually add the kernel as follows:
(venv) $ python -m ipykernel install --user --name=venv
The weights used for the downstream task (without the FC layer) can be found here
This work is supported by EPSRC DTP studentship (No. EP/R513106/1) and EU H2020 AI4Media (No. 951911). This research utilised Queen Mary's Apocrita HPC facility, supported by QMUL Research-IT. http://doi.org/10.5281/zenodo.438045
@inproceedings{foteinopoulou_emoclip_2024,
title = {{EmoCLIP}: {A} {Vision}-{Language} {Method} for {Zero}-{Shot} {Video} {Facial} {Expression} {Recognition}},
author = {Foteinopoulou, Niki Maria and Patras, Ioannis},
year = {2024},
booktitle={2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)}
}