Code for "Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations". This paper proposes a system that integrates emotion recognition with speech recognition and speaker diarisation in a jointly-trained model.
Two metrics proposed to evaluate emotion classification performance with automatic segmentation:
Convert stereo audio to single channel
data_prep/single_channel.py
Prepare reference transcriptions
data_prep/iemo_trans_raw.py # generate raw reference transcription from the dataset
data_prep/iemo_trans_organized.py # remove punctuation and special markers
Prepare emotion label
data_prep/iemo_lab_AER-cat.py # 6-way emotion classification label
Prepare VAD label
data_prep/iemo_lab_VAD-utt.py
data_prep/iemo_lab_VAD-seg.py
data_prep/iemo_lab_VAD-annote.py
Prepare training, validation, testing scp file
data_prep/iemocap_prepare.py
Train.py Train.yaml --output_folder=exp
Train.py Train.yaml --FWD_VAD=True --output_folder=exp
scoring/score_VAD.py
fwd_drz.py fwd_drz.yaml --output_folder=exp-eval
scoring/score_DER.py
Train.py Train.yaml --FWD_DRZ=True --output_folder=exp
scoring/score_cpWER.py
prepare_emo_rttm.py # prepare rttm file for (s)TEER evaluation
scoring/score_TEER.py # compute TEER and sTEER
N.B. Since the CTC loss function of PyTorch (torch.nn.functional.ctc_loss) may produce nondeterministic gradients when given tensors on a CUDA device, users may get slighty different results from those reported in the paper.
See https://pytorch.org/docs/1.11/generated/torch.nn.functional.ctc_loss.html for details.
Please cite:
@inproceedings{wu23_interspeech,
author={Wen Wu and Chao Zhang and Philip C. Woodland},
title={{Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={3607--3611},
doi={10.21437/Interspeech.2023-293}
}