Open leeyf99 opened 1 month ago
I trained the model using the parameter settings specified in the code, and the results are as follows: Audio Count Acc: 77.48 % Audio Compt Acc: 60.44 % Audio Averg Acc: 71.20 %
Visual Count Acc: 76.69 % Visual Local Acc: 77.06 % Visual Averg Acc: 76.88 %
Audio-Visual Exist Acc: 76.92 % Audio-Visual Count Acc: 76.36 % Audio-Visual Local Acc: 59.89 % Audio-Visual Compt Acc: 63.67 % Audio-Visual Templ Acc: 66.55 % Audio-Visual Averg Acc: 69.17 %
---->Overall Accuracy: 71.57 %
Could you clarify where the issue occurred? Is it related to the "audio_patch" feature?
Could you please clarify which pre-trained ToMe model is used when obtaining the "visual_patch" features? What is the setting for the "r" of ToMe? Additionally, I noticed that the "audio_patch" feature is not actually being utilized. Thanks.