GSSOC '23: Speech Emotion Recognition using Deep Learning Network Flow

From a machine learning perspective, speech emotion recognition is a classification problem where an input sample (audio) needs to be classified into a few predefined emotions. Of course, the challenge in this problem goes beyond technical – how does one even define emotion and consistently decide the class given an audio sample that can be ambiguous to even humans?

The issue is more pressing for dataset creators, but it also becomes essential while evaluating a trained model. Further below, we will see that our dataset contains two similar-sounding emotions, “calm” and “neutral,” which can be tricky for even humans to ascertain in ambiguous cases. Meanwhile, “angry” and “happy” have prominent differences that the model can quickly learn.

So, it is clear that machine learning models need to delve deeper into the feature extraction and non-linearity of the audio signals to effectively capture the nuanced differences in speech that humans can detect intuitively. Currently, researchers work with audio signals by treating them either as time-series data or using spectrograms to generate numeric and image forms of the audio. All these techniques involve some or the other kind of transformation to the original data, thus making feature loss likely. There is still a need to make machine learning models robust at learning features from audio data – robustness in classification or generation tasks will follow.

The-Data-Alchemists-Manipal / MindWave

GSSOC '23: Speech Emotion Recognition using Deep Learning Network Flow #565