LinWeizheDragon / Retrieval-Augmented-Visual-Question-Answering

This is the official repository for Retrieval Augmented Visual Question Answering
GNU General Public License v3.0
183 stars 15 forks source link

Question about infoseek wikipedia chunking and results split #36

Closed Maxlinn closed 8 months ago

Maxlinn commented 8 months ago

Hi lin, a series of your work(RA-VQA, FLMR, PreFLMR) has made great contributions to the field of KB-VQA, which is really impressive!

Recently i have special interest in InfoSeek task, after reading with due care, i still have two questions about the details, i wonder if you could generously help:

  1. What text preprocesses did you use for the wikipedia articles(accompained with infoseek task), especially for chunking? In the released knowledge base of infoseek, the wikipedia articles are rather long. Did you do any chunking to divide a wikipedia content into multiple candidate passages? And what prompt do you use for generating an answer based on the retrieved passage?
  2. In PreFLMR(https://arxiv.org/abs/2402.08327) Table 7, which split of InfoSeek does the result belong to? The test/human split of InfoSeek seems have not been released, is the result on the paper belongs to the val split, or M2KR-subsampled infoseek split?

Thanks in advance!

LinWeizheDragon commented 8 months ago

Hi, thanks for your interest in our work. Re your questions:

  1. We simply formatted the passage text as "title:", example['title'], "content:", example['text'] and did not do chunking. We left it long since we also wanted to facilitate possible future extensions to the length of passages. You can of course chunk the passages with any techniques you like. Just remember to make sure the comparison is fair (e.g. comparing models on the same chunked passage pool).
  2. In answer generation (for Infoseek), we use the same BLIP 2 model as in FLMR: Question: ... Caption: ... Objects: ... Knowledge: ... Answer: Specifically, the caption is extracted using BLIP 2 image captioning and objects are extracted using VinVL.
  3. Yes, we resplit the validation set of Infoseek and created a val set and a test set for the M2KR benchmark. The reported result is for the M2KR test split (which is subsampled from the original validation set).
Maxlinn commented 8 months ago

Hi, thanks for your interest in our work. Re your questions:

1. We simply formatted the passage text as `"title:", example['title'], "content:", example['text']` and did not do chunking. We left it long since we also wanted to facilitate possible future extensions to the length of passages. You can of course chunk the passages with any techniques you like. Just remember to make sure the comparison is fair (e.g. comparing models on the same chunked passage pool).

2. In answer generation (for Infoseek), we use the same BLIP 2 model as in FLMR: `Question: ... Caption: ... Objects: ... Knowledge: ... Answer:` Specifically, the caption is extracted using BLIP 2 image captioning and objects are extracted using VinVL.

3. Yes, we resplit the validation set of Infoseek and created a val set and a test set for the M2KR benchmark. The reported result is for the M2KR test split (which is subsampled from the original validation set).

Much appreciation to your timely and detailed response, i have much better understanding now!

About bullet 2, i have one more tiny question. To my best guess, "Knowledge:" should be followed by the retrieved passage. But the wikipedia articles in knowledge base of infoseek are about thousands of words, which seems impossible to fit the context length of BLIP2 without chunking, could you please throw more light on this?

Thanks for your patience!

LinWeizheDragon commented 8 months ago

Hi, thanks for your interest in our work. Re your questions:

1. We simply formatted the passage text as `"title:", example['title'], "content:", example['text']` and did not do chunking. We left it long since we also wanted to facilitate possible future extensions to the length of passages. You can of course chunk the passages with any techniques you like. Just remember to make sure the comparison is fair (e.g. comparing models on the same chunked passage pool).

2. In answer generation (for Infoseek), we use the same BLIP 2 model as in FLMR: `Question: ... Caption: ... Objects: ... Knowledge: ... Answer:` Specifically, the caption is extracted using BLIP 2 image captioning and objects are extracted using VinVL.

3. Yes, we resplit the validation set of Infoseek and created a val set and a test set for the M2KR benchmark. The reported result is for the M2KR test split (which is subsampled from the original validation set).

Much appreciation to your timely and detailed response, i have much better understanding now!

About bullet 2, i have one more tiny question. To my best guess, "Knowledge:" should be followed by the retrieved passage. But the wikipedia articles in knowledge base of infoseek are about thousands of words, which seems impossible to fit the context length of BLIP2 without chunking, could you please throw more light on this?

Thanks for your patience!

Hi, your concern is correct. We did not truncate the passages when generating the answer. This could potentially run out of the allowed tokens. This is also why we put knowledge after other useful information such as captions to avoid truncation.

But anyway, our purpose is to show that with augmented knowledge, models can get boosted performance easily. It is highly recommended that you take more careful steps in processing if you want to delve into the VQA ability, and better ultimate performance is 100% guaranteed. An easy approach would be splitting the passages into chunks, as mentioned in your previous post, or replacing the BLIP 2 with more advanced LMMs that allow more tokens. PreFLMR can be integrated with any answer generators.

Maxlinn commented 8 months ago

Thanks again for timely response! All my questions have been addressed!