LLaVAR reproduced results on DocVQA and STVQA do not match reported results

rohan598 commented 1 year ago

@zhangry868 @StevenyzZhang

Describe the issue

Can you please share a json file of results per sample and the overall accuracy for DocVQA and STVQA, along with parameters used for inference for the scores you have reported in the paper?

I am attaching a screenshot of my reproduced results on DocVQA and STVQA. Two parameters which are different in LLaVA code repo and MMOCR code for inference have been varied. Temperature: 0.2 or 0.9 QS template:

Template 1
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
or
Template 2
qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN

either question first or last (this uses IMAGE TOKEN as compared to IMAGE_PATCH_TOKEN*seq len, but both are equivalent, check LLaVA code for reference)

The decoding length is fixed to an upper limit of 256 as used in MMOCR

I am also attaching a zip file consisting of LLaVAR results on the two datasets with 4 different inference runs, that is, 8 runs in total. Each run consists of a folder with result json, that is the ANLS score in decimals and "dataset_name".json that has response to each sample.

More information about the inference run

Model Weights of LLaMA-13B used Model Delta Weights taken from hugging face. Folder name has v1 to select llava_v1 conv template.

Code Repository (Most recent version of) MM OCR: repo link LLaVA: repo link

My forked repo: MM OCR: repo link LLaVA: repo link

Command:

For DocVQA
python eval.py --model_name llavar --LLaVA_conv_template llava_v1 --LLaVA_model_path "path_to/llavar-v1-13b" --eval_docVQA  --docVQA_image_dir_path "path_to/val" --docVQA_ann_path "path_to/val/val_v1.0.json" --answer_path "path" --device "cuda" --qs_template 1 --temperature 0.2

For STVQA
python eval.py --model_name llavar --LLaVA_conv_template llava_v1 --LLaVA_model_path "path_to/llavar-v1-13b" --eval_STVQA --STVQA_image_dir_path "path_to/ST-VQA" --STVQA_ann_path "path_to/train_task_3.json" --answer_path "path" --device "cuda" --qs_template 1 --temperature 0.2

I have added 3 parameters to mmocr code, LLaVA_conv_template to select the correct template, qs_template reference mentioned above and temperature value

System Details: OS: 20.04.6 LTS (Focal Fossa) GPU: A6000 Nvidia CPU: AMD x86_64

Screenshots: Screenshot 2023-08-17 at 9 23 41 AM

llavar-docvqa-stvqa.zip

PLEASE LET ME KNOW IF YOU NEED MORE INFORMATION.

StevenyzZhang commented 1 year ago

Hi Rohan:

As mentioned in the MM OCR paper, we are using accuracy, not ANLS to evaluate STVQA and DocVQA.

The script we used for MM OCR is here.

I have uploaded the answer JSON file of four datasets to the LLaVAR/MultimodalOCR folder.

rohan598 commented 1 year ago

Hi @StevenyzZhang I tried using the code and method you shared, but there is run time issue with the new version of LLaVA. To be exact, the following line gives an error, which has been replace with the one below, which gives another error down the line, which is further solved by loading model using the new method shown. I believe the changes in LLaVA original repo might be not compatible with what you shared.

vision_tower = model.model.vision_tower[0] # original

replaced with

vision_tower = model.get_vision_tower() # no zero index

new method for loading model, tokenizer and image processor as per llava repo update

tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, model_base, model_name)

this function internally loads the vision tower (refer run_llava.py within LLaVA/llava/eval folder of the repo)

The issue with the shared code is that only vision config gets initialized but the vision tower model is never loaded. Therefore, when the generate function gets triggered and data needs to be passed through vision tower, there is an issue.

Can you please reproduce your results with the latest LLaVA repo and the code you shared in your comment? Thank you so much!!

StevenyzZhang commented 1 year ago

Hi Rohan:

The shared code should work well with the LLaVA code in this repo.

SALT-NLP / LLaVAR

LLaVAR reproduced results on DocVQA and STVQA do not match reported results #10

Describe the issue