GeWu-Lab / OGM-GE_CVPR2022

The repo for "Balanced Multimodal Learning via On-the-fly Gradient Modulation", CVPR 2022 (ORAL)
MIT License
218 stars 18 forks source link

关于CREMA-D上的ablation结果 #6

Open youcaiSUN opened 2 years ago

youcaiSUN commented 2 years ago

Dear authors,

我在CREMA-D上跑了一下关于所提方法OGM-GE的消融实验,实验结果显示OGM-GE带来的提升并没有论文Table 1中的那么明显,有时甚至差于不加OGM-GE的baseline,所以很诧异希望能得到答疑解惑。视觉分支使用1张图片(同论文中保持一致)的结果如下:

方法 Alpha BS Acc Best Epoch
Normal* - 64 51.7 -
OGM-GE* - 64 61.9 -
Normal - 16 60.22 32
Normal - 64 54.70 27
OGM-GE 0.0 16 58.47 42
OGM-GE 0.0 64 54.03 53
OGM-GE 0.1 16 59.00 36
OGM-GE 0.1 64 54.57 77
OGM-GE 0.2 16 58.74 52
OGM-GE 0.2 64 55.38 42
OGM-GE 0.3 16 58.47 48
OGM-GE 0.3 64 52.55 36
OGM-GE 0.4 16 59.54 52
OGM-GE 0.4 64 51.88 11
OGM-GE 0.5 16 58.74 43
OGM-GE 0.5 64 54.84 29
OGM-GE 0.6 16 59.27 68
OGM-GE 0.6 64 54.17 59
OGM-GE 0.7 16 55.18 43
OGM-GE 0.7 64 52.96 11
OGM-GE 0.8 16 56.05 49
OGM-GE 0.8 64 54.57 87
OGM-GE 0.9 16 58.33 98
OGM-GE 0.9 64 54.44 17
视觉分支使用3张图片(额外实验)的结果如下: 方法 Alpha BS Acc Best Epoch
Normal - 16 66.67 40
OGM-GE 0.0 16 67.34 88
OGM-GE 0.1 16 68.15 53
OGM-GE 0.2 16 65.46 60
OGM-GE 0.3 16 65.86 55
OGM-GE 0.5 16 63.71 9
OGM-GE 0.7 16 65.46 66
OGM-GE 0.9 16 64.25 47

其中:

  1. ‘’*“表示论文中Tab.1汇报结果(使用concat进行fusion),“-”表示未知或者不重要。
  2. 视频帧采用pre-processing中的obtain_frames.py脚本得到,对于CREMA-D数据集,由于视频长度一般小于4秒,所以改了一下其中的参数配置,每个视频提取4帧图片。通过设置dataset/dataset.py中getitem方法的pick_num变量使视觉编码器接收代码中默认的3张图片,或者论文中指定的针对该数据集的1帧图片。
  3. 音频的语谱图采用论文中的指定值,即每个视频的语音提取299帧,这里不足的补0,多余的截断。
  4. 除modulation、alpha和batch size外,其他训练超参数均为main.py中的默认值。

可以看出:

  1. 在CREMA-D数据集上,视觉分支使用3张图片要明显好于1张图片。
  2. 对于1张图片,batch size为16时要明显好于64,最好结果是Normal方法(即不使用OGM-GE的baseline)在batch size=16下的准确率60.22(但仍和论文中结果接近2个点的差距),使用OGM-GE时最高准确率只有59.54(alpha=0.4),而论文中Tab.1结果显示不使用OGM-GE性能有10个点左右的下降(61.9-->51.7),因此和论文中结果有很大出入;不同alpha的结果有一些差异,但大体上alpha取小点效果更好;当alpha=0,即只有GE而没有OGM时,也能取得不错的结果。
  3. 对于3张图片,当使用OGM-GE且alpha=0.1时效果最好,此外相较于上述1张图片的结果不同alpha取值敏感性更大。值得注意的是,Normal方法和alpha=0时仍然有和最佳alpha相当的性能。

由于实验设置方面可能并不完全相同,所以实验结果有些出入,希望能得到答疑解惑,谢谢!

zhengrongz commented 2 years ago

同有这个疑问,我也跑不到原论文的效果

dengandong commented 2 years ago

Hi @youcaiSUN @zhengrongz ,

Thank you so much for your interests to our work and also your careful experimental validation. It helps a lot for us to improve the quality of this repository.

I apologize for my carelessness when I reorganized the OGM-GE code. Yesterday, after correct some errors in main.py (specifically, one of the most important errors is at line 79, for 'out_v' the weight should be 'weight[:, 512:]' not 'weight[:, :512]', and the same for 'out_a'), making it consistent with our experimental code, we can obtain the correct results (higher than 0.64 with OGM-GE) in CREMA-D dataset. And the trained checkpoint has been updated, please refer to https://zenodo.org/record/6590986#.YpM4EVRBxPY. The corrected code is also updated, and you can validate again with the new code.

If you have further questions on this, please do not hesitate to contact us!

Thanks again!

Best, Andong

youcaiSUN commented 2 years ago

Hi Andong,

Thanks for your reply. Since the weight in the final fc layer can be exchangable for the audio and video features, I wonder why the relative order matters? Nevertheless, I am running the code after your fix.

Best, Licai

dengandong commented 2 years ago

Hi Andong,

Thanks for your reply. Since the weight in the final fc layer can be exchangable for the audio and video features, I wonder why the relative order matters? Nevertheless, I am running the code after your fix.

Best, Licai

Actually, you can refer to ConcatFusion in model/fusion_modules.py. Before the output FC layer, x and y are concatenated in a specific order, i.e., when getting through the FC layer, each element in x or y are only multiplied with specific element in the weight of FC layer at the corresponding position. And what we want to obtain is the separate output for x or y. This is why the order matters.

youcaiSUN commented 2 years ago

I got ya. Thanks!

zhengrongz commented 2 years ago

Got it! Thanks!

youcaiSUN commented 2 years ago

With the fixed code, I can reproduce the results in Tab. 1. Thanks again!

dengandong commented 2 years ago

With the fixed code, I can reproduce the results in Tab. 1. Thanks again!

No problem! And thank you again for this issue you raised!

Nikonal commented 1 year ago

请问一下,有没有videoflash里的数据,原链接的数据已经损坏了,无法读入