problems about audio feature and evaluation in emotion recognition

wangzhu3 commented 1 year ago

I think there may be two problems. First when I extract wavlm audio features by runnning setup_cmumosei.sh, I'm able to get 25 layers result which may include model input embedding. But other modal features such as BERT and CLIP has 24 layers result which doesn't include input embedding. Thus I think you should remove first layer result of wavlm (I think inserting y = y[:, 1:, :, :] in L.85 in extract_audio_embedding.py is efficient.)

Second, you enable we run not only sentiment analysis but also 6 multi-label emotion recognition. You use weighted accuracy and unweighted accuracy to evaluate model according to eval_performance.py when the mode is 6 multi-label emotion recognition . I think unweighted accuracy represents the normal average, and weighted accuracy and balanced accuracy means same thing, is that correct? If I understand them correctly, you have probably defined UA and WA interchangeably in eval_performance.py L.24 and 25```. If I understand weighted accuracy incorrectly, I want you tell me what is it.

ando-hub commented 1 year ago

First when I extract wavlm audio features by runnning setup_cmumosei.sh, I'm able to get 25 layers result which may include model input embedding. But other modal features such as BERT and CLIP has 24 layers result which doesn't include input embedding. Thus I think you should remove first layer result of wavlm (I think inserting y = y[:, 1:, :, :] in L.85 in extract_audio_embedding.py is efficient.)

Thank you for pointing this out. I confirmed the code and WavLM libs and fixed extract_audio_embedding.py.

I think unweighted accuracy represents the normal average, and weighted accuracy and balanced accuracy means same thing, is that correct?

Unweighted accuracy is the macro average of recalls (class 0 and 1 in each multi-label emotion) and weighted accuracy is the common accuracy. See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report.

wangzhu3 commented 1 year ago

Thank you for your answer.

Unweighted accuracy is the macro average of recalls (class 0 and 1 in each multi-label emotion) and weighted accuracy is the common accuracy

In the paper, which publish CMU-MOSEI dataset(Zade et al[2018], https://aclanthology.org/P18-1208/), they used weighted accuracy, and it was derived from this paper(Tong et al, https://aclanthology.org/P17-1142.pdf). In this paper, they defined weighted accuracy as below, WA=(TP*N/P + TN)/2N (TP is True Positive, TN is True Negative, P and N are the total number of positive and negative examples). And this weighted accuracy sometimes On the other hand common accuracy is defined as below Acc = (TP+TN)/P+N Thus I think weighted accuracy doesn't means common accuracy.

What definition of weighted accuracy do you refer to?

ando-hub commented 10 months ago

I'm really sorry for my very late reply. The metrics of unweighted/weighted accuracy have been widely used in speech emotion recognition, as mentioned in [1][2]. The definitions are described in [3] as follows:

We use two measures to evaluate the performance: weighted accuracy and unweighted accuracy. Weighted accuracy is the classification accuracy on the whole test set, and unweighted accuracy is an average of the recall for each emotion class, which better reflects overall accuracy in the presence of imbalanced class.

Please note that these metrics are common in multi-class emotion classification but not in multi-label binary classification. It might not be suitable to use them for CMU-MOSEI's 6-label binary classifications of emotions. [1] https://arxiv.org/pdf/2102.01813.pdf [2] https://arxiv.org/pdf/2110.06309.pdf [3] https://www.isca-speech.org/archive/pdfs/interspeech_2014/han14_interspeech.pdf

ando-hub / MSA_Pretrain

problems about audio feature and evaluation in emotion recognition #4