Closed clytze0216 closed 3 years ago
We have tested some other models like eltwise-prod, concat, and bilinear pooling models. However, these do not bring any improvements. We think the reason could be that the multimodal fusion has been conducted implicitly in the deep co-attention learning stage.
Since the modification is minor, you can try it by yourself, if any new results are found, we will appreciate it if you can tell us.
Thank you for sharing.I would like to ask if you have tried to change the linear multimodal fusion model, does it affect the accuracy?Looking forward to your reply. Thanks a lot!!