This repo contains the code for the paper:
Speech Emotion Recognition with Multi-task Learning, X. Cai et al., INTERSPEECH 2021
The code is based on https://github.com/huggingface/transformers/tree/master/examples/research_projects/wav2vec2.
pip install -r requirements.txt
You might also need to install libsndfile:
sudo apt-get install libsndfile1-dev
Or refer to https://github.com/libsndfile/libsndfile.
for f in iemocap/*.csv; do sed -i 's/\/path_to_wavs/\/wav_path/' $f; done
Note: The iemocap/*.csv has 20 files, corresponding to the data split into 10 folds, according to session ID (01F, 01M, ..., 05F, 05M). For each fold, use the other 9 sessions as training, and test on the selected session. For example, for the fold 01F, we use 01F as test set and remaining 9 sessions as training set. Two csv files for each fold, one for training and one for testing. The names are: iemocap_01F.train.csv and iemocap_01F.test.csv. The csv file has 3 columns: file, emotion, text. The column 'file' indicates where to store the wav file; the column 'emotion' is the emotion label (we use 4 labels: e0, e1, e2, e3); the column 'text' is for transcript. For example:
file,emotion,text
/path/to/Ses01F_impro01_F000.wav,e0,"EXCUSE ME ."
/path/to/Ses01F_impro01_F001.wav,e0,"YEAH ."
...
bash run.sh
This will run the code and generates results in output/tmp/ folder, while cache files are stored in cache/. The model = wav2vec2-base, alpha = 0.1, LR = 5e-5, effective batch size = 8, total train epochs = 200. The 01F split will be used as testing and remaining will be used as training.
NOTE: At around 40 epochs, the eval acc should already reaches a fairly good results. But due to the learning rate at this time is still high, there will be fluctuations. Nevertheless, verifying or early stopping could save time and get a reasonable good model.
WARNING: If running on 1 single GPU, 200 epochs will take days to finish. To speed up, consider using multiple GPUs. By default, the code use all GPUs in the system.
After training, you can run the inference code, using the saved model in output/tmp (or providing another path with a saved model):
bash prediction.sh output/tmp
This will generate a classification result, for the 01F_test split, in output/predictions/tmp. Details can be found in the script. NOTE: If you want to use your own inference data (prepared in a csv file), please modify the load_dataset() part in run_emotion.py.
To reproduce the paper's results using the cheeckpoints:
for s in 01F 01M 02F 02M 03F 03M 04F 04M 05F 05M; do bash reproduce.sh ckpts/$s/ $s > output/$s.eval; done
This will generate evaluation results using the downloaded checkpoints. The evaluation results are saved in output/ folder, named as 01F.eval, 01M.eval, etc. The last line in each *.eval file, indicates number of correct predictions and accuracy for that folder (10-folds in total). If you sum the correct number and divide by 5531 (total utterance number), you should get the accuraccy slightly above 0.78.
Key parameters:
Parameters not recommended: