donghao51 / SimMMDG

[NeurIPS 2023] SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization
49 stars 1 forks source link

Questions on evaluation metrics #6

Closed lz19991122 closed 1 week ago

lz19991122 commented 2 weeks ago

Dear Dr. Dong,

Hello! I am very interested in your paper and code. However, this is my first time dealing with the action recognition and classification task in the video modality. When studying your paper, I have encountered some doubts and hope to get your answers.

First, the paper describes video files, but in reality, all I see are jpg images. May I ask why this presentation method is adopted? And how do these images correspond to the time of audio?

Second, regarding the label division problem of video segments. In the P08_01 folder in the D1 domain, jpg images contain many actions. I would like to ask you how you divide the labels of these actions? Is a continuous series of image frames (such as frame_0000001.jpg - frame_000000n.jpg) divided into one action category, or is the action category divided according to video time?

Third, how are the labels of audio divided?

Fourth, the paper only reports the evaluation score but does not specify the specific evaluation index. I would like to know how the article validates the classification results? What is the specific evaluation index? Is the verification process carried out according to the categories divided by each video segment?

If you have time to answer my doubts, I will be extremely grateful!

donghao51 commented 2 weeks ago

hello, thanks for your interests in our work!

  1. EPIC-kitchens dataset split video into different image frame, the input to the network are also multiple image frames instead of a single image. HAC dataset provides the original videos.

  2. we use the labels provided by EPIC-Kitchens (under MM-SADA_Domain_Adaptation_Splits folder), with a beginning time and an ending time representing an action with a label

  3. as in 2, we have start and end time of an action, so we can get the corresponding audio

  4. we report the average accuracy

lz19991122 commented 1 week ago

Thank you very much for your reply! Your work is extremely excellent. I am continuously learning all of your work. Thank you for doing such excellent work.

There is still one question about validation that confuses me though.

For example, video 00:00-00:30 is classified as label 1 and 00:30-00:45 is classified as label 2.

1. when we classify, are we classifying the whole video from 00:00-00:30?
    Input ---> 00:00-00:30 video, output --->class1.
    In validation, verify the accuracy of 00:00-00:30, and 00:30-00:45.
    For example:
    True labels: 00:00-00:30 --->class1; 00:30-00:45 --->class2.
    Predicted labels: 00:00-00:30-->class2; 00:30-00:45-->class2.
The accuracy is 0.5.

2. or the classification process is a continuous process? The model is constantly classifying as the video frames are input. For example, when dividing into one frame per second.
    Input ---> frame 1, output ---> class1; Input ---> frame 2, output ---> class1; Input ---> frame 45, output ---> class2; ......
    When verifying, verify the accuracy of each frame.
    For example:
    True label: frame 1 --->class1; ..... Frame 30--->class1; Frame 31--->class2; Frame 45--->class2.
    Predictive labeling: frame 1--->class2; ..... Frame 2--->class2; ..... ; frame 44--->class1; frame 45--->class2.
The accuracy is 1/45.

The second seems more practical. I would like to ask you if the verification in your paper belongs to the second kind?

Thanks so much for your patience.

donghao51 commented 1 week ago

hello, we use the first setup

our task is video classification, so it is better to feed multiple frames instead of just 1 frame

lz19991122 commented 1 week ago

Thank you very much! I'll be following up on your work. Your work has been very inspiring and valuable to me.