LAION-AI / audio-dataset

Audio Dataset for training CLAP and other models
639 stars 53 forks source link

emotion recognition webdata #102

Open jiachengc opened 2 months ago

jiachengc commented 2 months ago

Thanks for this solid work. Have you released any preprocessed emotion recognition web dataset like Ravdess, Cream-D, or any data processing files so we can process the data ourselves? @knoriy @YuchenHui22314

YuchenHui22314 commented 2 months ago

Thanks for you comment, but as far as I know, we did not use emotion recognition datasets in the end. Good luck with you research!

jiachengc commented 1 month ago

Thanks for you comment, but as far as I know, we did not use emotion recognition datasets in the end. Good luck with you research!

Thank you for your quick reply. I would like to ask a quick question: if my dataset is an audio emotion recognition dataset, such as TESS, when I process the corresponding webdata, should I rewrite the 'text' column to represent the emotion label corresponding to the audio instead of caption of audio? For example, { "text": [ "happy" ], "tag": [ "happy" ], "original_data": { "title": "TESS - Toronto Emotional Speech Set", "desciption": "Dataset for emotion recognition from audio", "license": "TESS dataset license", "fname": "OAF_back_happy.flac", "category": "happy" } } doing so in order to allow the model to output the corresponding emotion predictions. Looking forward to see your reply. Thanks in advance!. @YuchenHui22314

YuchenHui22314 commented 1 month ago

Then the "text" should be a complete sentence, e.g. ["this is an happy sound"]. So you may want to come up with a way to make up a sentence using emotion labels.

jiachengc commented 1 month ago

Then the "text" should be a complete sentence, e.g. ["this is an happy sound"]. So you may want to come up with a way to make up a sentence using emotion labels.

Thank you for your quick reply, I really appreciate it. I followed your suggestion and modified the text to captions like ['this is a happy sound'], and then used eval_linear_probe.py to fine-tune the last linear layer of audio encoder. However, the results are quite poor on iemocap dataset, with an accuracy of around 55%. So far, I've tried a range of learning rates: [1e-2, 1e-3, 1e-4, 1e-5], weight decay values: [0.1, 0.01, 0.001, 0.001], and linear probe losses: [ce, mse]. However, none of these combinations have achieved accuracy beyond 55%. I'm feeling a bit lost about the next debugging direction and would greatly appreciate any suggestions you might have. Thanks in advance again! @YuchenHui22314

YuchenHui22314 commented 1 month ago

Oh ok. So you are doing supervised classification instead of contrastive pretraining. I thought that you would like to add emotion dataset as part of pretraining data. "text" should be a sentence only in the pretraining process, but when it comes to supervised classification, I am not familir with the eval_linear_probe.py code. Maybe you could reach out to Ke Chen on this!