Repository with the code of the paper: A proposal for Multimodal Emotion Recognition using aural transformers and Action Units on RAVDESS dataset
Install ffmpeg from:
https://www.ffmpeg.org/download.html#build-linux
To install the python packages, create a new virtual environment and run:
pip install git+https://github.com/huggingface/datasets.git
pip install git+https://github.com/huggingface/transformers.git
pip install -r requirements.txt
** If problems installing certain libraries, try to update your pip version: pip3 install --upgrade pip and run again the previous command
For reproducing the experiments, first you need to download the dataset used for the experiments:
**IMPORTANT NOTE: For training/testing the models, we only use 1.440 videos (only speech channel) not songs. See in MMEmotionRecognition/data/ravdess_videos.csv the name of these files.
Once downloaded, put them in your working directory, in what follows, we will refer to these directories as:
For evaluating our models, we used a subject-wise 5CV. The distribution per actor for the validation folds was as follows:
To extract the audios from the videos and change their format to 16kHz & single channel, run:
python3 MMEmotionRecognition/src/Audio/preProcessing/process_audio.py
--videos_dir <RAVDESS_dir>/videos
--out_dir <RAVDESS_dir>/audios_16kHz
To fine-tune the xlsr-Wav2Vec2.0 model, run:
python3 MMEmotionRecognition/src/Audio/FineTuningWav2Vec/main_FineTuneWav2Vec_CV.py
--audios_dir <RAVDESS_dir>/audios_16kHz --cache_dir MMEmotionRecognition/data/Audio/cache_dir
--out_dir <RAVDESS_dir>/FineTuningWav2Vec2_out
--model_id jonatasgrosman/wav2vec2-large-xlsr-53-english
After finishing the fine-tuning process, the datasets and the trained models will be saved in the folder
Important Note: Results can vary a little from the reported in the paper because we added some extra lines to optimize the saving of the weights, which affects to the randomization of the training
To evaluate and get some metrics of the trained model, you should run the Wav2Vec2.0 script, as in the example below. Notice that you should modify the dict of line 119 (checkpoints_per_fold) to the top models in case you changed the seed or used the code for other tasks.
python3 MMEmotionRecognition/src/Audio/FineTuningWav2Vec/Wav2VecEval.py
--data <RAVDESS_dir>/FineTuningWav2Vec2_out/data/YYYYMMDD_HHMMSS
--fold 0
--trained_model <RAVDESS_dir>/FineTuningWav2Vec2_out/trained_models/wav2vec2-xlsr-ravdess-speech-emotion-recognition/YYYYMMDD_HHMMSS
--out_dir <RAVDESS_dir>/FineTuningWav2Vec2_posteriors
--model_id jonatasgrosman/wav2vec2-large-xlsr-53-english
To evaluate our models and generate the posteriors, the code to execute would be the following:
python3 MMEmotionRecognition/src/Audio/FineTuningWav2Vec/Wav2VecEval.py
--data MMEmotionRecognition/data/models/wav2Vec_top_models/FineTuning/data/20211020_094500
--fold 0
--trained_model MMEmotionRecognition/data/models/wav2Vec_top_models/FineTuning/trained_models/wav2vec2-xlsr-ravdess-speech-emotion-recognition/20211020_094500
--out_dir <RAVDESS_dir>/FineTuningWav2Vec2_posteriors
--model_id jonatasgrosman/wav2vec2-large-xlsr-53-english
Notice that for the evaluation, you will get the metrics per fold, so for obtaining the final average accuracy, you should have run the previous command changing the fold to 0, 1, 2, 3, and 4. After running the precious command 5 times, we will run the following script to obtain the final accuracy (plotted on the console):
python3 MMEmotionRecognition/src/Audio/FineTuningWav2Vec/FinalEvaluation.py
--dataPosteriors MMEmotionRecognition/data/models/wav2Vec_top_models/FineTuning/posteriors/20211020_094500
--trained_model MMEmotionRecognition/data/models/wav2Vec_top_models/FineTuning/trained_models/wav2vec2-xlsr-ravdess-speech-emotion-recognition/20211020_094500
To extract the features, first, we need to run the fine-tuning section to generate the train.csv and test.csv files. After running previous section, we could extract the features from the generated files, running the following command:
python3 MMEmotionRecognition/src/Audio/FeatureExtractionWav2Vec/FeatureExtractor.py
--data MMEmotionRecognition/data/models/wav2Vec_top_models/FineTuning/data/20211020_094500
--model_id jonatasgrosman/wav2vec2-large-xlsr-53-english
--out_dir <RAVDESS_dir>/FineTuningWav2Vec2_embs512
python3 MMEmotionRecognition/src/Audio/FeatureExtractionWav2Vec/FeatureTraining.py
--embs_dir <RAVDESS_dir>/embs512
--model_number 11
--param (80)
--type_of_norm 2
--out_dir MMEmotionRecognition/data/models/avg_MLP80_Audio
To extract the Action Units (AUs) using the OpenFace library, we run:
python3 MMEmotionRecognition/src/Video/OpenFace/AUsFeatureExtractor.py
--videos_dir <RAVDESS_dir>/videos
--openFace_path <OpenFace_dir>
--out_dir <RAVDESS_dir>/Extracted_AUs
--out_dir_processed <RAVDESS_dir>/processed_AUs
Once we extract the embeddings, we can train the visual models:
To train and evaluate the static models, run the command below. Notice that this command will also save the posteriors generated by the trained models in the path passed in out_dir.
python3 MMEmotionRecognition/src/Video/models/staticModels/FeatureTrainingAUs.py
--AUs_dir <RAVDESS_dir>/processed_AUs
--model_number 11
--param (80)
--type_of_norm 1
--out_dir MMEmotionRecognition/data/models/avg_MLP80_AUs/posteriors
See README in MMEmotionRecognition/src/Video/models/sequenceLearning/README_AUS.md
python3 MMEmotionRecognition/src/Fusion/FusionTraining.py
--embs_dir_wav2vec <RAVDESS_dir>/FineTuningWav2Vec2_posteriors/20211020_094500
--embs_dir_biLSTM <RAVDESS_dir>/FUSION/wav2Vec_AUs/BiLSTM_AUS/posteriors
--embs_dir_MLP MMEmotionRecognition/data/posteriors/avg_MLP80_AUs/posteriors
--out_dir <RAVDESS_dir>/FUSION/posteriors
--model_number 2
--param 1.0
--type_of_norm 1
To replicate our results, run:
python3 MMEmotionRecognition/src/Fusion/FusionTraining.py
--embs_dir_wav2vec MMEmotionRecognition/data/posteriors/wav2Vec/posteriors/20211020_094500
--embs_dir_biLSTM MMEmotionRecognition/data/posteriors/AUs_biLSTM_6213/posteriorsv2
--embs_dir_MLP MMEmotionRecognition/data/posteriors/avg_MLP80_AUs/posteriors
--out_dir ''
--model_number 2
--param 1.0
--type_of_norm 1
Top model Avg. Accuracy: 86.70%
To download the weights of the trained models (only Wav2Vec2.0 and bi-LSTM), click on this linkB (~16GB):
https://drive.upm.es/s/AYULcdl44m2Tj8C
In total, they have to be 1440. Check MMEmotionRecognition/data/ravdess_videos.csv for a complete list of the names of the used videos.
Trained models follow this dictionary to do the predictions: {'Angry': 0, 'Calm': 1, 'Disgust': 2, 'Fear': 3, 'Happy': 4, 'Neutral': 5, 'Sad': 6, 'Surprise': 7} So,e.g. if we introduce a sample whose ground-truth is 'Angry', ideally, we would receive at the output something like: [1,0,0,0,0,0,0,0]
MIT License
If you use the code of this work or the generated models, please cite the following paper:
IEEE format:
C. Luna-Jiménez, R. Kleinlein, D. Griol, Z. Callejas, J. M. Montero, and F. Fernández-Martínez, “A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset,” Applied Sciences, vol. 12, no. 1, p. 327, Dec. 2021.
(You can find more citation formats in the MDPI page)
If you have any question or you find a bug in the code, please contact us at:
We would like to thank to m3hrdadfi for his open tutorial on Speech Emotion Recognition (Wav2Vec 2.0) that we used as base to train our speech emotion recognition models on RAVDESS dataset