Closed ys-zong closed 1 year ago
It is a good and open question that how to evaluate the generated answer from the decoder for unanswerable question. In this project, we did not address this issue and I think this is a missing part of this project. Instead, we use BLIP's decoder to rank the GQA candidate answers (rank 1 is selected as prediction). The pretrained checkpoint from (https://github.com/salesforce/LAVIS/tree/main#visual-question-answering-vqa) is used. More specifically, given a list of predefined answers, we use this function (https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip_models/blip_vqa.py#L228-L236) from BLIP to rank the answers. The probability of each answer is then use for rejection and determine unanswerable questions. Thanks for bring this up and look forward to you solution for evaluating generated answers on unanswerable questions.
Hi, thanks for the nice work and code! BLIP outputs natural texts for VQA tasks with its decoder, which is unlike UNITER/LXMERT/etc that have the encoder-only architectures. So, I'm wondering how do you evaluate the the performance of BLIP in the paper. Specifically, after generating answers to the questions, how did you determine whether the responses refuted a false premise or still give answers to the incorrect question? Many thanks!