This is the PyTorch implementation of model TDGAN for facial expression recognition (FER), which is an algorithm presented in the following paper:
Xie, Siyue, Haifeng Hu, and Yizhen Chen. "Facial Expression Recognition with Two-branch Disentangled Generative Adversarial Network." IEEE Transactions on Circuits and Systems for Video Technology (2020).
Abstract: Facial Expression Recognition (FER) is a challenging task in computer vision as features extracted from expressional images are usually entangled with other facial attributes, e.g., poses or appearance variations, which are adverse to FER. To achieve a better FER performance, we propose a model named Two-branch Disentangled Generative Adversarial Network (TDGAN) for discriminative expression representation learning. Different from previous methods, TDGAN learns to disentangle expressional information from other unrelated facial attributes. To this end, we build the framework with two independent branches, which are specific for facial and expressional information processing respectively. Correspondingly, two discriminators are introduced to conduct identity and expression classification. By adversarial learning, TDGAN is able to transfer an expression to a given face. It simultaneously learns a discriminative representation that is disentangled from other facial attributes for each expression image, which is more effective for FER task. In addition, a self-supervised mechanism is proposed to improve representation learning, which enhances the power of disentangling. Quantitative and qualitative results in both in-the-lab and in-the-wild datasets demonstrate that TDGAN is competitive to the state-of-the-art methods.
Note: The face image in the above pipeline diagram is provided by my friend Dr. Shiyuan Li. He once said he has longed for seeing his face being presented in some public publications so I took his figure as an illustrative example in this paper. (Of course, I obtained his permission before placing the image.) But he still complained that the image I chose was far from his satisfactory as he doesn't look very handsome in the image.
The paper is available at here.
@article{xie2020facial,
title={Facial expression recognition with two-branch disentangled generative adversarial network},
author={Xie, Siyue and Hu, Haifeng and Chen, Yizhen},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
volume={31},
number={6},
pages={2359--2371},
year={2020},
publisher={IEEE}
}
./Dataset/examples
python main.py
. This command will perform an evaluation step and generate an image under directory ./Dataset/examples
, with a file name of generated_example.png
.The framework of the model:
TDGAN learns image representations through expression transferring. The inputs are two images: one facial image with identity label and one expressional image with expression label. The goal of the generator is to transfer the expression from the expressional image to the face image. Therefore, TDGAN includes two separate branch for extracting corresponding image representations and then using a deconv-based decoder to generate the image. To make sure that the generated image fulfills our expectation, there are two discriminators for TDGAN to evaluate the generated image. One is a face discriminator, which conducts identity classification (so that TDGAN can know whether the face appearance in the generated image is the same as that of the input facial image). The other is an expression discriminator, which conducts expression classification (so that TDGAN can know whether the expression in the generated image is the same as that of the input expressional image). Note that the expected identity information can only be extracted from the input facial image, while the expected expression information can only be extracted from the input expressional image. Therefore, by adversarial training, the face branch will tend to extract only identity (facial appearance) related features while the expression branch will tend to extract only expression related features. In other word, the expression branch is induced to disentangle expressional features from other features. We can therefore take such expression-specific features for expression classification task.
TDGAN
│ LoadData.py
│ main.py
│ models.py
│ README.md
│ trainer.py
│ util.py
└───Dataset
│ └───CASIA_WebFace
| | | casia_data_examples.npz
│ | └───img
│ | └───0000045
│ | └───0000099
│ └───examples
│ | | expression.jpg
│ | | face.jpg
│ └───RAF
| | RAF_examples.npz
│ └───img
main.py
: main function of the model. Optional arguments:
CASIA
is supported; you can customize your own dataset by modifying the dataloader (default: CASIA
)RAF
is supported; you can customize your own dataset by modifying the dataloader (default: RAF
)True
for training and False
evaluation (default: False
)100
)32
)1e-4
)-1
for cpu setting and a positive integer for GPU setting (default: -1
)hpar_dict
in the main.py
. If you want to specify each hyper-parameter, you can manually modify the values.trainer.py
: trainer for model training and evaluationmodels.py
: all modules' implementations LoadData.py
: customized datasets class for pytorch's dataloaderutil.py
: some other functions./Dataset/CASIA_WebFace/casia_data_examples.npz
: the file that stores the file name of all required images in the CASIA_WebFace
dataset./Dataset/RAF/RAF_data_examples.npz
: the file that stores the file name of all required images in the RAF
datasetExperiments are conducted on three in-the-lab datasets (CK+, TFEID, RaFD) and two in-the-wild datasets (BAUM-2i, RAF-DB).
All the datasets are publicly accessible in the following links:
In the preprocessing stage, faces in the input images (including both face and expression images) are first detected by the MTCNN and then resized to 128x128x1. Data augmentation (random cropping and horizontal flipping) is also adopted in the training stage. To fairly compared with other methods, in CK+, TFEID, RaFD and BAUM-2i, we conduct subject-independent 10-fold cross-validation to evaluate our model. In RAF-DB, TDGAN is trained and evaluated in the predefined training and testing sets.
Dataset | Accuracy (%) |
---|---|
CK+ | 97.53 |
TFEID | 97.20 |
RaFD | 99.32 |
BAUM-2i | 65.76 |
RAF-DB | 81.91 |
The purpose of TDGAN is to recognize facial expressions. However, since we use GAN framework for disentangling, TDGAN can also generate some interesting images as by-product. By observing on the generated images, we can also evaluate whether different attributes are disentangled with each other in some extent. Here we conduct two kinds of visualization experiments, i.e., expression transferring (interpolation) and face interpolation.
But note that our main task is expression recognition, we don't care about the quality of the generated images. Also, the following images are cherry-picked, i.e., it is not guaranteed that the model can always successfully transfer a given expression to another face.
Given a face image, we can modify the expression in the face based on what expression images are given. Also, we can fix the face image, and interpolate between two expression images to see how the expression in the generated face is gradually changed from one to another. This shows that TDGAN can disentangle expressional features from facial attributes as it can modify the expression but keep the facial appearance untouched.
Given a person's facial image, we change his expression from fear to anger:
The animation to show the transition process:
Given a person's facial image, we change his expression from neutral to happiness:
The animation to show the transition process:
Given two face images, we can interpolate between these two face so that we can observe the transition process from one face to another face. This also shows that TDGAN is able to disentangle facial appearance features from expression features as the facial appearance is changed but the expression is kept untouched.
Given an expression image (Happiness) and two person's image, we change the facial appearance from one to another, but keep the expression unchanged:
The animation to show the transition process:
Given an expression image (Neutral) and two person's image, we change the facial appearance from one to another, but keep the expression unchanged:
The animation to show the transition process: