chihhuiho / RGQA

Official code for Toward Unsupervised Realistic Visual Question Answering
3 stars 0 forks source link

How to evaluate BLIP-like models? #1

Closed ys-zong closed 1 year ago

ys-zong commented 1 year ago

Hi, thanks for the nice work and code! BLIP outputs natural texts for VQA tasks with its decoder, which is unlike UNITER/LXMERT/etc that have the encoder-only architectures. So, I'm wondering how do you evaluate the the performance of BLIP in the paper. Specifically, after generating answers to the questions, how did you determine whether the responses refuted a false premise or still give answers to the incorrect question? Many thanks!

chihhuiho commented 1 year ago

It is a good and open question that how to evaluate the generated answer from the decoder for unanswerable question. In this project, we did not address this issue and I think this is a missing part of this project. Instead, we use BLIP's decoder to rank the GQA candidate answers (rank 1 is selected as prediction). The pretrained checkpoint from (https://github.com/salesforce/LAVIS/tree/main#visual-question-answering-vqa) is used. More specifically, given a list of predefined answers, we use this function (https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip_models/blip_vqa.py#L228-L236) from BLIP to rank the answers. The probability of each answer is then use for rejection and determine unanswerable questions. Thanks for bring this up and look forward to you solution for evaluating generated answers on unanswerable questions.