DocVQA - Githubissues

MichaelHypS commented 3 months ago

Hi, great work and thanks for sharing the code and weights!

I tried the OCR on your sample and it works well. However, may I ask how could we go to perform VQA? Something similar to your paper example: "\<VQA> What's the title?". Could you perhaps give us a snippet for this please?

Veason-silverbullet commented 3 months ago

Hi, @MichaelHypS. Thanks for your attention. I am very sorry that due to the computation resource limit, I have no plan to re-train the downstream VQA model which was trained with 8 V100 (32G). Here, I provide some instructions on how to fine-tune the VQA model.

Fine-tuning Code. I have provided the code to fine-tune ViTLP. Please refer to https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/finetune.py and #1 . I think you can easily adapt the code to fine-tune ViTLP on VQA datasets.
Dataset Arrangement. Given the fine-tuning code above, then you need to prepare the VQA dataset. I have provided the downstream OCR dataset preprocess pipeline at https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/preprocess_data.py. The VQA dataset preprocess pipeline should be similar to it. Btw, please note that you need to augment the tokenizer and ViTLP word embedding with a special \<VQA> token.

Another tricky thing on dataset preparation is coding to obtain the answer bounding boxes with heuristic rules based on 1) ground-truth answers and 2) document OCR results, for which you may put some effort.

MichaelHypS commented 3 months ago

@Veason-silverbullet Thanks for the answer! Funny, finding the answer bounding box seemed to me the part that I was the least afraid of :) But I could imagine it will take me more time than anticipated.

Anyway, I decided to give it a shot and create my own VQA preprocessing script. As I like to understand the basics before adding any extra complexity, may I ask if the following seems alright to you so far?

I started by creating a small json file to put two questions from your GPT-4 example, such as:

`[

{
    "question": "<VQA> What's the title?",
    "word": "GPT-4 Technical Report",
    "bbox": [328, 49, 681, 81]
},
{
    "question": "<VQA> Who is the author(s)?",
    "word": "OpenAI*",
    "bbox": [470, 143, 545, 161]
}

]`

Then, I tried to mimic your preprocessing script by concatenating the questions and respective answers (word) after adding the special token for "\<VQA>" (token: 50267). This lead to the following:

"\<VQA> What's the title? GPT-4 Technical Report" -> [50267, 653, 18, 5, 1270, 116, 272, 10311, 12, 306, 12920, 2872]

and

"\<VQA> Who is the author(s)? OpenAI*" -> [50267, 3394, 16, 5, 2730, 1640, 29, 26610, 2117, 15238, 3226]

Once this is done, I simply flatten each question-answer by adding for each the \<start_token_id> (2), \<locate_token_id> (50265) and the \<eos> (2). Resulting in:

[2, 50267, 653, ... ,2872, 50265, 2, 2, 50267, 3394, 16, ... , 15238, 3226, 50265, 2, 1, 1, ... , 1]

or in other words:

[\<start> \<sequence0> \<locate> \<eos> \<start> \<sequence1> \<locate> \<eos> \<padding>]

Does is seems correct to you? Specifically,

is the sequence: [\<locate> \<eos> \<start>] correct? I feel funny about the [2, 2]
the bbox is put only once at the position as well with a "locate_token_type"?
what would you recommend to do if for example the answer lies in two lines? Like at the end of a line and beginning of the next one.

Finally, perhaps a bit of a silly question, but I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script ? Could you perhaps say something about their differences please?

Veason-silverbullet commented 3 months ago

@MichaelHypS I appreciate you checking the code carefully. TLDR, your understanding is basically right. Based on my experience, any format of VQA fine-tuning sequence is OK if the fine-tuning and inference sequence patterns are consistent.

For each point of your questions, the following are my suggestions:

is the sequence: [\<locate> \<eos> \<start>] correct? I feel funny about the [2, 2]

My implementation is [\<decoder_start_token_id> \<question_sequence> \<eos> \<answer_sequence> \<locate> \<eos> \<padding>]. Of course, your mentioned implementation is OK too (just need to keep the format consistent in inference).

The bos_token (DECODER_START_TOKEN_ID = 2) is a legacy inherited from BART/T5 BPE tokenizer.
the bbox is put only once at the position as well with a "locate_token_type"?

Yes, and appear at the same position, which is for pre-training loss computing.
what would you recommend to do if for example the answer lies in two lines? Like at the end of a line and beginning of the next one.

Thanks for your accurate understanding and question. For such cases, I treat the whole area of these two lines as the answer bounding box.
I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script

Yes, use ViTLPForDocVQA instead of ViTLPForPreTraining. Also, please refer to the data loader at https://github.com/Veason-silverbullet/ViTLP/blob/main/dataset/docvqa.py.

I really appreciate your effort in checking the code. I decide to re-write the VQA data pre-process and fine-tune codes. However, since I am swamped on weekdays, I plan to do this job for the next two weekends. Please stay tuned.

MichaelHypS commented 3 months ago

@Veason-silverbullet I also appreciate your answers :) Thanks a lot!

So I made the changes to follow your sequences. I now understand better why the first DECODER_START_TOKEN_ID = 2, thanks. Therefore, I am currently looking into your "DocVQATrainDataset". I have one small questions about it. Does the \<locate_token_id> part of the labels within the "qa_span"? Just to illustrate the question, let's take again my former example (but with your edits), I have:

\<VQA> What's the title? GPT-4 Technical Report" -> [50267, 653, 18, 5, 1270, 116] + [2] + [272, 10311, 12, 306, 12920, 2872] + [50265] + [2]

My current understanding from looking at your code is that you construct a "qa_span" that would resemble the following:

[1, ..., 1] + [0] + [2, ...,2] + [0] + [0]

Where we would now set the question with the normal "word_token_type" and set the answer with the "answer_span_type". Similarly to the OCR training where the bboxes have also a special "localization_token_type". But I'm not sure if the \<locate_token_id> should be part of it or not. My intuition would have actually put the entire answer, so all the way to the \<eos> token within the label. Such as:

[1, ..., 1] + [0] + [2, ...,2] + [2] + [2]

But then this would be inconsistent with the OCR training script. I am asking because I have create a "qa_span" array for simplicity. But looking at your code, it seems that we could perhaps use your "token_type" array to encode every task together. Such that my final array could resemble

[1, ..., 1] + [0] + [3, ...,3] + [2] + [0]

if we keep the \<locate_token_type> = 2 and set a new "\<answer_token_type>" = 3 for example.

Since you're swamped, once I have something running I could also share my code. I only made quite some edits to yours as to use yaml file directly instead of the argparse. Therefore it's not entirely straightforward to use within your implementation.

Veason-silverbullet commented 2 months ago

@MichaelHypS I have prepared the DocVQA fine-tuning code at https://github.com/Veason-silverbullet/ViTLP/tree/main/finetuning.

For the answer bounding-boxes, check the metadata file at DocVQA/train-metadata.json.
For preparing the input sequence format (as we discussed earlier), please refer to preprocess_docvqa_data.py.
For DocVQA fine-tuning code, please refer to finetune_docvqa.py.

After fine-tuning, you may need to prepare the inference code, or I will provide it next weekend.

MichaelHypS commented 2 months ago

Thanks a lot for the scripts! I see a couple difference with what I tried. Most notable that you are not concatenating all the qa togethers and also split the answer per bbx and token.

SongDoHou commented 2 months ago

Hi @Veason-silverbullet , thank you for sharing your work! It's really cool model. Do you have any plan to share the DoCVQA model's weight file? I look forward to inference code too! :)

Veason-silverbullet commented 2 months ago

@MichaelHypS @SongDoHou , the DocVQA inference code is updated at https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/inference_docvqa.py.

@SongDoHou Due to the company's policy, only the base model can be open-sourced. I have no plans to share DocVQA model's weight for the time being, as I am struggling with my work and I have no GPU resources to fine-tune the DocVQA model at the university. Nevertheless, I may rent GPUs to do it next month if I am free at that time.

MichaelHypS commented 2 months ago

Thanks a lot! It's nice to be able to compare your code with what I did. I have a very small note for your code at line 196. I would suggest to change it to: bboxes = bboxes.tolist() if bboxes is not None else [] Just so that if the first generated token is the EOS the code doesn't break, which actually happened to me with my small dummy training set.

I do have yet another question, I tried to train my model (with 2 extra token added to the dictionary size as you mentioned earlier) as well as your pipeline on my small dummy dataset to see if I can simply overfit it (2 images with 2 question on each) where I set a 1000 iterations. This is a simple sanity check to make sure that everything learns correctly. I noticed that both codes leads to a lm_loss starting at around 11 and then goes down and hoover around 2.6. I also tried to look at the generated tokens while training but honestly I can't tell much apart that both models seems to generate only locate_id tokens... Anyway, after the training my pipeline leads to a single locate_token followed by the EOS, while yours create a single letter and then also generate the locate_token followed by the EOS. Therefore I am not that confident that the model has learned correctly. Do you have some inputs about this behavior and how I could make sure that my training works?

Veason-silverbullet commented 2 months ago

@MichaelHypS Thanks for your attention. I will provide the training logs and checkpoints for your reference next weekend.

Veason-silverbullet commented 2 months ago

@MichaelHypS I've tested the following DocVQA fine-tuning script:

# Step 1: Clone/pull the latest code (updated on 01/09/2024)
git clone https://github.com/Veason-silverbullet/ViTLP.git
cd ViTLP
mkdir -p ckpts/ViTLP-medium
git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium

# Step 2: Manually download DocVQA document images from https://rrc.cvc.uab.es/?ch=17&com=downloads
cd finetuning
# Download and extract DocVQA document images into ./DocVQA/documents from https://rrc.cvc.uab.es/?ch=17&com=downloads
ls ./DocVQA
# The `documents` should be located at `./DocVQA`
# bboxes-train-80.npy  images.txt  qa_span_types-train-80.npy  token_types-train-80.npy  train-mapping.txt    train_v1.0_withQT.json
# documents            link.py     test_v1.0.json              tokens-train-80.npy       train-metadata.json  val_v1.0_withQT.json

# Step 3: Fine-tuning DocVQA
# Effective batch size = num_nodes * num_gpus * batch_size * gradient_accumulation_steps
# Make `Effective batch size` to be 128, set `gradient_accumulation_steps` at `./misc/zero1_fp16-grad_acc-16.json` depending on your computation resources
# Since I only have 4 Nvidia-3090 (24G), I have to set gradient_accumulation_steps = 16.
nohup deepspeed --num_nodes 1 --num_gpus 4 finetune_docvqa.py --batch_size=2 --deepspeed_config=misc/zero1_fp16-grad_acc-16.json --output_dir=DocVQA-outputs > nohup.out 2>&1 &

Since I only have 4 Nvidia-3090 (24G) at hand, the fine-tuning takes ~6 days. I can only release the full training logs and checkpoints next week. Ideally, if 8 A100 are available, the fine-tuning can be done in hours setting gradient_accumulation_steps = 1.

MichaelHypS commented 2 months ago

Thanks a lot your continuous support, really appreciated, I will try to run your script as well for sanity check and then mine. My setup is rather similar to yours so it may take a bit of time until your hear from me as well :)

Veason-silverbullet commented 2 months ago

@MichaelHypS The DocVQA checkpoint is available at https://drive.google.com/drive/folders/1zZNw76DQTBPBv4Uuw-Bvuba_poYqc8ZK?usp=drive_link. Please feel free to have a shot.

Also, we have some important updates. Please pull the latest commit (10/09/24). The updates include

Increase fine-tuning resolution compared to the pre-training, which is a key to DocVQA performance.
Update fine-tuning data. Previously, some fine-tuning data was missing as the heuristic-rule code finetuning/DocVQA/link.py could not link some boxes to answers. I updated link.py and the fine-tuning data last week.
Please check the latest Readme for DocVQA inference instruction. The checkpoint provided above is for running the finetuning/inference_docvqa.py only. The checkpoint was fine-tuned with old data, its performance might be a little bit inferior. I will fine-tune it with new data this week (and put the official checkpoint to HuggingFace later).

Veason-silverbullet commented 2 months ago

@MichaelHypS , for your request, the training loss curve is below

The training log is also provided below log.zip

MichaelHypS commented 2 months ago

Thanks a lot for the amazing work!

Veason-silverbullet / ViTLP

DocVQA #3