GeWu-Lab / TSPM

Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.
14 stars 1 forks source link

Specific Settings of the ToMe Model #3

Open leeyf99 opened 1 month ago

leeyf99 commented 1 month ago

Could you please clarify which pre-trained ToMe model is used when obtaining the "visual_patch" features? What is the setting for the "r" of ToMe? Additionally, I noticed that the "audio_patch" feature is not actually being utilized. Thanks.

xia-zhe commented 2 weeks ago

I trained the model using the parameter settings specified in the code, and the results are as follows: Audio Count Acc: 77.48 % Audio Compt Acc: 60.44 % Audio Averg Acc: 71.20 %

Visual Count Acc: 76.69 % Visual Local Acc: 77.06 % Visual Averg Acc: 76.88 %

Audio-Visual Exist Acc: 76.92 % Audio-Visual Count Acc: 76.36 % Audio-Visual Local Acc: 59.89 % Audio-Visual Compt Acc: 63.67 % Audio-Visual Templ Acc: 66.55 % Audio-Visual Averg Acc: 69.17 %

---->Overall Accuracy: 71.57 %

Could you clarify where the issue occurred? Is it related to the "audio_patch" feature?