MILVLG / mcan-vqa

Deep Modular Co-Attention Networks for Visual Question Answering
Apache License 2.0
438 stars 88 forks source link

linear fusion model #30

Closed clytze0216 closed 3 years ago

clytze0216 commented 3 years ago

Thank you for sharing.I would like to ask if you have tried to change the linear multimodal fusion model, does it affect the accuracy?Looking forward to your reply. Thanks a lot!!

MIL-VLG commented 3 years ago

We have tested some other models like eltwise-prod, concat, and bilinear pooling models. However, these do not bring any improvements. We think the reason could be that the multimodal fusion has been conducted implicitly in the deep co-attention learning stage.

Since the modification is minor, you can try it by yourself, if any new results are found, we will appreciate it if you can tell us.