How to evaluate BLIP-like models?

It is a good and open question that how to evaluate the generated answer from the decoder for unanswerable question. In this project, we did not address this issue and I think this is a missing part of this project. Instead, we use BLIP's decoder to rank the GQA candidate answers (rank 1 is selected as prediction). The pretrained checkpoint from (https://github.com/salesforce/LAVIS/tree/main#visual-question-answering-vqa) is used. More specifically, given a list of predefined answers, we use this function (https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip_models/blip_vqa.py#L228-L236) from BLIP to rank the answers. The probability of each answer is then use for rejection and determine unanswerable questions. Thanks for bring this up and look forward to you solution for evaluating generated answers on unanswerable questions.

chihhuiho / RGQA

How to evaluate BLIP-like models? #1