[Progress Report]Research on non-image classification method(SVM,KNN,HMM)

Currently we are trying to verify the architecture described in " Ooi, C., Seng, K., Ang, L. and Chew, L. (2014). A new approach of audio emotion recognition" . In this document they proposed a way to classify 6 emotions. First they Identify 3 major features that can be extracted from the audio, which is zero crossing rate, log energy and Teager Energy operator and pitch. According to the chart they provided, they set specific threshold value to each of the feature in order to separate them into 3 different groups with 2 maximum MFCC divergence emotion Here's the feature chart

qq 20181001211621 pitch_paper

We tried to verify if the pattern of emotion are similar in our dataset. After cleaning and calculating we have the following chart

Note we will abandon the class "neutral" and class "calm" .

qq 20181030012648 qq 20181030012702 qq 20181030012711 qq 20181030012722

We can see that trend are similar, however the ranking(sequence from highest value of emotion to lowest ) of each feature is somewhat different from the chart provided

Sequence comparison

Pitch Sequence:

From research paper:

Angry
Surprise
Happy
Disgust
Fear
Sad

From our data set:

Angry
Fearful
Disgust
Happy
Surprise
Sad

Zero Crossing Rate From research paper:

Sad
Disgust
Surprise
Angry
Happy
Fear

From our data set:

Surprise
Happy
Sad
Fear
Disgust
Angry

Log Energy From research paper:

Sad
Disgust
Surprise
Angry
Happy
Fear

From our data set:

Fear
Happy
Angry
Surprise
Sad
Disgust

Root mean square energy represent the mean of the overall energy Log energy takes the natural log of the overall energy The order/sequence should preserved if both calculation use the same overall energy(same data)

We see that there's a difference between the research paper and our verification. I suspect that this is mainly due to the fact that we are using different data set. We couldn't get our hands on their data set unfortunately. However, this brings out issue with the algorithm, since the sequence varies this huge between the data set, the threshold they are setting will be data-set-dependent. This reflects that it will not be a good generic algorithm. This may perform very well on their data-set, since the train/validate/test data are from the same data source (Therefore the data are correlated ). This will result in bad accuracy when the input are not similar to the data source.

This review is based on previous comment.

Stepping away from the fact that these features are data dependent. lets check if MFCC can really separate different groups of emotion.

We can test this by looping through all the combination of the emotion( in groups of two).

But first here is a graph of the overall MFCC coefficient for 8 emotions, we use PCA to reduce number of component 2 and plot it.

qq 20181030012731

After this we are curious to find the best match, which is looking for highest accuracy that a pair of emotion group can get.

From the result graph. we can see that happy/sad received highest classification accuracy of 70%.

qq 20181030012746

Here's the mfcc points of happy and sad

qq 20181030012757

However, this is based on 100% accuracy of assigning groups(yes, which group of emotion the input audio belongs to). Previously, we mentioned a problem that feature are data dependent and we are using these feature to classify audio into groups. We may tune the parameter and increase the number of component to increase the accuracy further. However the accuracy will still depend on group assignment, which will lower the total accuracy.

A quick example is: Saying that the accuracy I can correctly classify the emotion into groups is 70%(if we are randomly guessing, it will be 33%), our overall accuracy will be 0.7*0.7 = 49% accurate. Comparing results from the paper where they claim to reach about 70% overall accuracy (implies that nearly 100% accuracy in group classification).

For next step, I'll feed the features into machine learning algorithm(can be neural network or just a simple svm) to do the classification, which will give the a complete prototype of this model and a close estimation of the performance.

Here I propose a quick modification in processing of assigning groups. Inside the paper, they assign groups based on thresholds of the feature. Instead of determine threshold values manually, I'll construct a neural network to do the classification based on the input feature. The input are limited as of now. Currently we only have three values per input data that can be used as input towards this neural network: mean of pitch, mean of root square energy and mean of zero-crossing rate. We definitely need more input dimension. I'm looking into music features that can help expanding the dimension. Here's a link that explain the tonality. http://www.nyu.edu/classes/bello/MIR_files/tonality.pdf (In this paper, they prompt a way of recognizing chord. The method is to cross check between the input and the template chord. This may provide some insight.) Since tone and emotions are similar, in sense of a certain tone may represent frequency. We can use features from tone to analysis emotion. I'll investigate the following features: spectral contrast and spectral flatten

A problem with the above implementation is that this actually make one task 'classify emotion' into two problems': 'classifying groups of emotions and then classify emotion inside the group'. We should at least try on the first question before we break it down. I constructed multi-class SVM and an neural network. Features are the weighted mfcc coefficients and means of spectral features. With taking about 20 parameters from mfcc. The result is : for six emotions classification accuracy of multiclass SVM is 0.24416135881104034 accuracy of neural network is 0.3184713375796178

Result is not very optimistic.

capstone496 / SpeechSentiments

[Progress Report]Research on non-image classification method(SVM,KNN,HMM) #7