Closed LYuhang closed 4 years ago
The code is available in here, don't you have access? [https://github.com/HLR/Cross_Modality_Relevance]
I am very grateful that you reply so quickly. But I think the code in this project is used for NLVR2 dataset, which takes two images and one string as input while Visual Question Answering takes only one image and one string as input. Beside, the relevance matrix calculation is slightly different between these two tasks. According to my understanding about the two tasks, there are 4 relevance matrices (image1_rel-image2_rel, image1_rel-string_rel, image2_rel-string_rel, image-string) to calculate for NLVR2 while there are only 2 relevance matrices to calculate for VQA (image-string and image_rel-string_rel). Thus, do you mean that I just only need to repeat the image twice to satisfy the model input or I need to change the code slightly myself. Thank you!
Thank you for your attention to our ACL work! We have mentioned in the last paragraph of section 3.3 in our ACL paper. For VQA dataset, the above setting results in one entity relevance representation: a textual-visual entity relevance. For NLVR2 dataset, there are three entity relevance representations: two textual-visual entity relevance and a visual-visual entity relevance between two images. Therefore, all you need to do is that comment the code of the second image's textual-visual entity relevance and visual-visual entity relevance. Since you remove two entity relevances, please don't forget to change the self.final_classifier input dimension in the cmr_nlvr2_model.py. By the way, our result of VQA is shown in the leaderboard: https://evalai.cloudcv.org/web/challenges/challenge-page/163/leaderboard/498#leaderboardrank-8 Feel free to ask me if you have further questions.
Thank you very much!
Hello, I am doing research on Visual Question Answering. Could you release the code for VQA task? I would be very grateful.