Closed rohan598 closed 1 year ago
Hi Rohan:
As mentioned in the MM OCR paper, we are using accuracy, not ANLS to evaluate STVQA and DocVQA.
The script we used for MM OCR is here.
I have uploaded the answer JSON file of four datasets to the LLaVAR/MultimodalOCR folder.
Hi @StevenyzZhang I tried using the code and method you shared, but there is run time issue with the new version of LLaVA. To be exact, the following line gives an error, which has been replace with the one below, which gives another error down the line, which is further solved by loading model using the new method shown. I believe the changes in LLaVA original repo might be not compatible with what you shared.
vision_tower = model.model.vision_tower[0] # original
replaced with
vision_tower = model.get_vision_tower() # no zero index
new method for loading model, tokenizer and image processor as per llava repo update
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, model_base, model_name)
this function internally loads the vision tower (refer run_llava.py within LLaVA/llava/eval folder of the repo)
The issue with the shared code is that only vision config gets initialized but the vision tower model is never loaded. Therefore, when the generate function gets triggered and data needs to be passed through vision tower, there is an issue.
Can you please reproduce your results with the latest LLaVA repo and the code you shared in your comment? Thank you so much!!
Hi Rohan:
The shared code should work well with the LLaVA code in this repo.
@zhangry868 @StevenyzZhang
Describe the issue
Can you please share a json file of results per sample and the overall accuracy for DocVQA and STVQA, along with parameters used for inference for the scores you have reported in the paper?
I am attaching a screenshot of my reproduced results on DocVQA and STVQA. Two parameters which are different in LLaVA code repo and MMOCR code for inference have been varied. Temperature: 0.2 or 0.9 QS template:
either question first or last (this uses IMAGE TOKEN as compared to IMAGE_PATCH_TOKEN*seq len, but both are equivalent, check LLaVA code for reference)
The decoding length is fixed to an upper limit of 256 as used in MMOCR
I am also attaching a zip file consisting of LLaVAR results on the two datasets with 4 different inference runs, that is, 8 runs in total. Each run consists of a folder with result json, that is the ANLS score in decimals and "dataset_name".json that has response to each sample.
More information about the inference run
Model Weights of LLaMA-13B used Model Delta Weights taken from hugging face. Folder name has v1 to select llava_v1 conv template.
Code Repository (Most recent version of) MM OCR: repo link LLaVA: repo link
My forked repo: MM OCR: repo link LLaVA: repo link
Command:
I have added 3 parameters to mmocr code, LLaVA_conv_template to select the correct template, qs_template reference mentioned above and temperature value
System Details: OS: 20.04.6 LTS (Focal Fossa) GPU: A6000 Nvidia CPU: AMD x86_64
Screenshots:
llavar-docvqa-stvqa.zip
PLEASE LET ME KNOW IF YOU NEED MORE INFORMATION.