MultimodalAffectiveComputing / FV2ES

A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition
30 stars 3 forks source link

Queries on the Repo #10

Open eswar159 opened 7 months ago

eswar159 commented 7 months ago

I'm firstly new with working with videos

so i got thought your Repo and i felt really interested in working on this :)

i got few questions (which might be very basic but I'm really confused with this part )

1) First i downloaded the full data file (dataa.zip) I see you used 2 datasets I'm mainly planning to work CMU-MOSEI so I'm mainly concentrating on it... it has 3 main folders : i) MOSEI_SPLIT : This folder has 3 files which are txt files and which have just the file names / id's (which is 100% clear) ii) MOSEI_HCF_FEATURES : this also has 3 files which are pkl files and i was able extract them and the have (vision , audio, text, label and id) id : which is just name of that video / file label : originally as per info anywhere i see only 6 labels but when i count the unique entries in this field in (Train :- 27 , Test :- 23, Valid:- 25 unique entries and what are the corresponding numbers?) my major confusion is how are 6 labels mapped to this many? text : I see each string has been made to a 50 X 300 exactly which preprocessing method has been used here audio : I see each audio had been made to 500 X 74 exactly which preprocessing method has been used here vision : I see each image has been made to 500 X 35 exactly which preprocessing method has been used here and when talking about vision part each short video has multiple images how is been handled / which image are used exactly iii) MOSEI_RAW_PROCESSED : This folder is clear as it has just the videos and short videos and broken into audio files and images

My main question here is if i want replicate some work just like you can i use only MOSEI_HCF_FEATURES folder (as it has train test validation)?

ICMiF commented 3 months ago

I'm firstly new with working with videos

so i got thought your Repo and i felt really interested in working on this :)

i got few questions (which might be very basic but I'm really confused with this part )

  1. First i downloaded the full data file (dataa.zip) I see you used 2 datasets I'm mainly planning to work CMU-MOSEI so I'm mainly concentrating on it... it has 3 main folders : i) MOSEI_SPLIT : This folder has 3 files which are txt files and which have just the file names / id's (which is 100% clear) ii) MOSEI_HCF_FEATURES : this also has 3 files which are pkl files and i was able extract them and the have (vision , audio, text, label and id) id : which is just name of that video / file label : originally as per info anywhere i see only 6 labels but when i count the unique entries in this field in (Train :- 27 , Test :- 23, Valid:- 25 unique entries and what are the corresponding numbers?) my major confusion is how are 6 labels mapped to this many? text : I see each string has been made to a 50 X 300 exactly which preprocessing method has been used here audio : I see each audio had been made to 500 X 74 exactly which preprocessing method has been used here vision : I see each image has been made to 500 X 35 exactly which preprocessing method has been used here and when talking about vision part each short video has multiple images how is been handled / which image are used exactly iii) MOSEI_RAW_PROCESSED : This folder is clear as it has just the videos and short videos and broken into audio files and images

_My main question here is if i want replicate some work just like you can i use only MOSEI_HCFFEATURES folder (as it has train test validation)?

Hello, I would like to train this model on my own dataset. But I am not sure about the details of the dataset format. May I ask if you can provide me with a small example?

MultimodalAffectiveComputing commented 3 months ago

I'm firstly new with working with videos

so i got thought your Repo and i felt really interested in working on this :)

i got few questions (which might be very basic but I'm really confused with this part )

  1. First i downloaded the full data file (dataa.zip) I see you used 2 datasets I'm mainly planning to work CMU-MOSEI so I'm mainly concentrating on it... it has 3 main folders : i) MOSEI_SPLIT : This folder has 3 files which are txt files and which have just the file names / id's (which is 100% clear) ii) MOSEI_HCF_FEATURES : this also has 3 files which are pkl files and i was able extract them and the have (vision , audio, text, label and id) id : which is just name of that video / file label : originally as per info anywhere i see only 6 labels but when i count the unique entries in this field in (Train :- 27 , Test :- 23, Valid:- 25 unique entries and what are the corresponding numbers?) my major confusion is how are 6 labels mapped to this many? text : I see each string has been made to a 50 X 300 exactly which preprocessing method has been used here audio : I see each audio had been made to 500 X 74 exactly which preprocessing method has been used here vision : I see each image has been made to 500 X 35 exactly which preprocessing method has been used here and when talking about vision part each short video has multiple images how is been handled / which image are used exactly iii) MOSEI_RAW_PROCESSED : This folder is clear as it has just the videos and short videos and broken into audio files and images

_My main question here is if i want replicate some work just like you can i use only MOSEI_HCFFEATURES folder (as it has train test validation)?

First of all, thank you very much for your interest in our study. For your five questions, we have the following answers:

  1. The content of the label. In response to your question, the label in the pkl files should correspond to a value for each video clip. For example, the label’s dimension of the ‘mosei_senti_hcf_valid.pkl’ file is [1871,1,1], which means the emotional score of each of the 1871 valid videos. Therefore we recommended that you check again to see if there are any loading issues.
  2. Methods for dividing and processing text, audio and images. The processing of the three modalities is mainly simple python processing. You can find the detailed code in the ‘datasets.py’. The file path is ‘FV2ES/V2EM_prediction/src/datasets.py’.
  3. Picture selection. All the images generated after preprocessing of each short video need to be processed in the model, and there is no specific choice.
  4. Use of MOSEI_HCF_FEATURES. You can directly use the MOSEI_HCF_FEATURES as the input of the model. The MOSEI_HCF_FEATURES file is the preprocessed video data, but you need to make the adjustment of the specific model when using it.
  5. Simple example of data set. In response to your question, we have uploaded a case from the IEMOCAP data set, please see the repository for details. Thank you for your question.