Confusion Regarding CLIP Evaluation in VQA Tasks

WeixiongLin / PMC-CLIP

The official codes for "PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents"

MIT License

204 stars 11 forks source link

Thank you for your excellent work. I have some confusion regarding the evaluation as follows:

In the paper, it is mentioned that pre-trained CLIP can be applied to downstream VQA tasks. For close-ended answers, which can be understood as a binary classification of "yes or no," I can predefine two pieces of text for "yes" and "no" to calculate the similarity between the given image and the "yes" or "no" text to compute accuracy (I hope this understanding is correct). However, for open-ended answers, the paper mentions a fusion module and candidate answers, and I’m unclear about where the candidate answers come from and the role of the fusion module in this context.

WeixiongLin / PMC-CLIP

Confusion Regarding CLIP Evaluation in VQA Tasks #25