Thank you for your excellent work. I have some confusion regarding the evaluation as follows:
In the paper, it is mentioned that pre-trained CLIP can be applied to downstream VQA tasks. For close-ended answers, which can be understood as a binary classification of "yes or no," I can predefine two pieces of text for "yes" and "no" to calculate the similarity between the given image and the "yes" or "no" text to compute accuracy (I hope this understanding is correct). However, for open-ended answers, the paper mentions a fusion module and candidate answers, and I’m unclear about where the candidate answers come from and the role of the fusion module in this context.
The fusion module is provided in code as in pmc-clip.py
For open-ended VQA, to align with SOTA, we adopt the common practive in community which use all the potential answers provided in ground truth as candidate answers. So it's transformed into closed-ended VQA.
We don't agree with the practice that, addresses open-ended VQA as close-ended ones. So we propose another work autogressively generates answers, as in PMC-VQA
Thank you for your excellent work. I have some confusion regarding the evaluation as follows:
In the paper, it is mentioned that pre-trained CLIP can be applied to downstream VQA tasks. For close-ended answers, which can be understood as a binary classification of "yes or no," I can predefine two pieces of text for "yes" and "no" to calculate the similarity between the given image and the "yes" or "no" text to compute accuracy (I hope this understanding is correct). However, for open-ended answers, the paper mentions a fusion module and candidate answers, and I’m unclear about where the candidate answers come from and the role of the fusion module in this context.