LinWeizheDragon / Retrieval-Augmented-Visual-Question-Answering

This is the official repository for Retrieval Augmented Visual Question Answering
GNU General Public License v3.0
160 stars 14 forks source link

Didn't find dpr_training_annotations #6

Closed yao-jz closed 1 year ago

yao-jz commented 1 year ago

Hi, when I try to train the DPR model, I didn't find the dpr_training_annotations files in the repo.

The config in the jsonnet is

local dpr_training_annotations = {
  "train": "../data/ok-vqa/pre-extracted_features/passages/retriever_train.json",
  "valid": "../data/ok-vqa/pre-extracted_features/passages/retriever_testdev.json",
  "test": "../data/ok-vqa/pre-extracted_features/passages/retriever_test.json",
};

I didn't find the three json files.

Thanks!

LinWeizheDragon commented 1 year ago

Could you please double check whether the files are here? I am not close to my workstation at the moment. But I think I successfully ran all training with the following. Packed pre-extracted data for both OK-VQA and F-VQA (including OCR features, VinVL object detection features, Oscar captioning features): Google Drive

yao-jz commented 1 year ago

I didn't use the pre-extracted data you provided.

I have generated the OCR features, VinVL object detection features, and the captions with your code.

LinWeizheDragon commented 1 year ago

I see. The missing files are annotations from the GS author. Could you please download the file and take out the missing files from it? I packed everything into this file at release.

yao-jz commented 1 year ago

I tried several times with different network but still failed to download the file. I think it is too large to be downloaded with Google Drive.

I download the retriever_train/testdev/test.json file here.

But at the same time, I also find there is another missing file: okvqa_full_corpus_title.csv.

LinWeizheDragon commented 1 year ago

Will Baiduyun work for you? If so, I can upload a copy there tomorrow.

okvqa_full_corpus_title.csv adds a dummy "title" column to okvqa_full_corpus.csv so that it can be processed by the script that generates the index file. It is also in the packed file.

yao-jz commented 1 year ago

I have already downloaded the missing file retriever_* from another repo. Thank you very much.

There is a "kid" column in the okvqa_full_corpus.csv (as shown below). Is that the dummy column?

kid,text
0,text
1,text
...
LinWeizheDragon commented 1 year ago

Here are first two rows of that file:

kid,text
passage,"about the doberman pinscher dobermans are compactly-built dogs—muscular, fast, and powerful—standing between 24 to 28 inches at the shoulder.  dobermans are compactly-built dogs—muscular, fast, and powerful—standing between 24 to 28 inches at the shoulder."
passage,history: a german named louis dobermann is credited with developing the doberman pinscher breed in the late 1800s. he was a tax collector and wanted a fierce guard dog to accompany him on his rounds.
yao-jz commented 1 year ago

But the first two rows in okvqa_full_corpus.csv I download:

kid,text
0,"about the doberman pinscher dobermans are compactly-built dogs—muscular, fast, and powerful—standing between 24 to 28 inches at the shoulder.  dobermans are compactly-built dogs—muscular, fast, and powerful—standing between 24 to 28 inches at the shoulder."
1,history: a german named louis dobermann is credited with developing the doberman pinscher breed in the late 1800s. he was a tax collector and wanted a fierce guard dog to accompany him on his rounds.
LinWeizheDragon commented 1 year ago

link:https://pan.baidu.com/s/17CJ_yWdsDX3Agz4nTnYX4g password:xj3e

I think I refactored the kid column for some reason. You can just keep them and modify the processing script that generates the FAISS index. I believe that you won't have difficulty running the DPR training since it doesn't require this _title.csv file. This file is only used in creating the FAISS index.

For your convenience, I shared the passage files above.

yao-jz commented 1 year ago

Thanks!