Closed adrianSRoman closed 7 months ago
Filename identifiers
Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
Developed under https://github.com/aiden200/2D3MF/tree/user/steve/resnet
The task is to perform pre-training on a ResNet model on a simple emotion detection classification task. The input to the ResNet should be MFCCs computed using 1second of audio with a sampling rate of
sr=44100Hz
andn_mfcc=10
i.e. you can uselibrosa.feature.mfcc(y=y_1sec_audio, sr=44100, n_mfcc=10)
.The network can be trained with audioclips from the RAVDESS dataset. The labels to predict should be 8: 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised
This pre-trained network will then be used as a feature extractor within our 2D3MF pipeline.