Open MichaelHypS opened 3 months ago
Hi, @MichaelHypS. Thanks for your attention. I am very sorry that due to the computation resource limit, I have no plan to re-train the downstream VQA model which was trained with 8 V100 (32G). Here, I provide some instructions on how to fine-tune the VQA model.
Fine-tuning Code. I have provided the code to fine-tune ViTLP. Please refer to https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/finetune.py and #1 . I think you can easily adapt the code to fine-tune ViTLP on VQA datasets.
Dataset Arrangement. Given the fine-tuning code above, then you need to prepare the VQA dataset. I have provided the downstream OCR dataset preprocess pipeline at https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/preprocess_data.py. The VQA dataset preprocess pipeline should be similar to it. Btw, please note that you need to augment the tokenizer and ViTLP word embedding with a special \<VQA> token.
Another tricky thing on dataset preparation is coding to obtain the answer bounding boxes with heuristic rules based on 1) ground-truth answers and 2) document OCR results, for which you may put some effort.
@Veason-silverbullet Thanks for the answer! Funny, finding the answer bounding box seemed to me the part that I was the least afraid of :) But I could imagine it will take me more time than anticipated.
Anyway, I decided to give it a shot and create my own VQA preprocessing script. As I like to understand the basics before adding any extra complexity, may I ask if the following seems alright to you so far?
I started by creating a small json file to put two questions from your GPT-4 example, such as:
`[
{
"question": "<VQA> What's the title?",
"word": "GPT-4 Technical Report",
"bbox": [328, 49, 681, 81]
},
{
"question": "<VQA> Who is the author(s)?",
"word": "OpenAI*",
"bbox": [470, 143, 545, 161]
}
]`
Then, I tried to mimic your preprocessing script by concatenating the questions and respective answers (word) after adding the special token for "\<VQA>" (token: 50267). This lead to the following:
"\<VQA> What's the title? GPT-4 Technical Report" -> [50267, 653, 18, 5, 1270, 116, 272, 10311, 12, 306, 12920, 2872]
and
"\<VQA> Who is the author(s)? OpenAI*" -> [50267, 3394, 16, 5, 2730, 1640, 29, 26610, 2117, 15238, 3226]
Once this is done, I simply flatten each question-answer by adding for each the \<start_token_id> (2), \<locate_token_id> (50265) and the \<eos> (2). Resulting in:
[2, 50267, 653, ... ,2872, 50265, 2, 2, 50267, 3394, 16, ... , 15238, 3226, 50265, 2, 1, 1, ... , 1]
or in other words:
[\<start> \<sequence0> \<locate> \<eos> \<start> \<sequence1> \<locate> \<eos> \<padding>]
Does is seems correct to you? Specifically,
Finally, perhaps a bit of a silly question, but I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script ? Could you perhaps say something about their differences please?
@MichaelHypS I appreciate you checking the code carefully. TLDR, your understanding is basically right. Based on my experience, any format of VQA fine-tuning sequence is OK if the fine-tuning and inference sequence patterns are consistent.
For each point of your questions, the following are my suggestions:
is the sequence: [\<locate> \<eos> \<start>] correct? I feel funny about the [2, 2]
My implementation is [\<decoder_start_token_id> \<question_sequence> \<eos> \<answer_sequence> \<locate> \<eos> \<padding>]. Of course, your mentioned implementation is OK too (just need to keep the format consistent in inference).
The bos_token (DECODER_START_TOKEN_ID = 2
) is a legacy inherited from BART/T5 BPE tokenizer.
the bbox is put only once at the
position as well with a "locate_token_type"?
Yes,
what would you recommend to do if for example the answer lies in two lines? Like at the end of a line and beginning of the next one.
Thanks for your accurate understanding and question. For such cases, I treat the whole area of these two lines as the answer bounding box.
I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script
Yes, use ViTLPForDocVQA
instead of ViTLPForPreTraining
. Also, please refer to the data loader at https://github.com/Veason-silverbullet/ViTLP/blob/main/dataset/docvqa.py.
I really appreciate your effort in checking the code. I decide to re-write the VQA data pre-process and fine-tune codes. However, since I am swamped on weekdays, I plan to do this job for the next two weekends. Please stay tuned.
@Veason-silverbullet I also appreciate your answers :) Thanks a lot!
So I made the changes to follow your sequences. I now understand better why the first DECODER_START_TOKEN_ID = 2, thanks. Therefore, I am currently looking into your "DocVQATrainDataset". I have one small questions about it. Does the \<locate_token_id> part of the labels within the "qa_span"? Just to illustrate the question, let's take again my former example (but with your edits), I have:
\<VQA> What's the title? GPT-4 Technical Report" -> [50267, 653, 18, 5, 1270, 116] + [2] + [272, 10311, 12, 306, 12920, 2872] + [50265] + [2]
My current understanding from looking at your code is that you construct a "qa_span" that would resemble the following:
[1, ..., 1] + [0] + [2, ...,2] + [0] + [0]
Where we would now set the question with the normal "word_token_type" and set the answer with the "answer_span_type". Similarly to the OCR training where the bboxes have also a special "localization_token_type". But I'm not sure if the \<locate_token_id> should be part of it or not. My intuition would have actually put the entire answer, so all the way to the \<eos> token within the label. Such as:
[1, ..., 1] + [0] + [2, ...,2] + [2] + [2]
But then this would be inconsistent with the OCR training script. I am asking because I have create a "qa_span" array for simplicity. But looking at your code, it seems that we could perhaps use your "token_type" array to encode every task together. Such that my final array could resemble
[1, ..., 1] + [0] + [3, ...,3] + [2] + [0]
if we keep the \<locate_token_type> = 2 and set a new "\<answer_token_type>" = 3 for example.
Since you're swamped, once I have something running I could also share my code. I only made quite some edits to yours as to use yaml file directly instead of the argparse. Therefore it's not entirely straightforward to use within your implementation.
@MichaelHypS I have prepared the DocVQA fine-tuning code at https://github.com/Veason-silverbullet/ViTLP/tree/main/finetuning.
For the answer bounding-boxes, check the metadata file at DocVQA/train-metadata.json.
For preparing the input sequence format (as we discussed earlier), please refer to preprocess_docvqa_data.py.
For DocVQA fine-tuning code, please refer to finetune_docvqa.py.
After fine-tuning, you may need to prepare the inference code, or I will provide it next weekend.
Thanks a lot for the scripts! I see a couple difference with what I tried. Most notable that you are not concatenating all the qa togethers and also split the answer per bbx and token.
Hi @Veason-silverbullet , thank you for sharing your work! It's really cool model. Do you have any plan to share the DoCVQA model's weight file? I look forward to inference code too! :)
@MichaelHypS @SongDoHou , the DocVQA inference code is updated at https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/inference_docvqa.py.
@SongDoHou Due to the company's policy, only the base model can be open-sourced. I have no plans to share DocVQA model's weight for the time being, as I am struggling with my work and I have no GPU resources to fine-tune the DocVQA model at the university. Nevertheless, I may rent GPUs to do it next month if I am free at that time.
Thanks a lot! It's nice to be able to compare your code with what I did.
I have a very small note for your code at line 196. I would suggest to change it to:
bboxes = bboxes.tolist() if bboxes is not None else []
Just so that if the first generated token is the EOS the code doesn't break, which actually happened to me with my small dummy training set.
I do have yet another question, I tried to train my model (with 2 extra token added to the dictionary size as you mentioned earlier) as well as your pipeline on my small dummy dataset to see if I can simply overfit it (2 images with 2 question on each) where I set a 1000 iterations. This is a simple sanity check to make sure that everything learns correctly. I noticed that both codes leads to a lm_loss starting at around 11 and then goes down and hoover around 2.6. I also tried to look at the generated tokens while training but honestly I can't tell much apart that both models seems to generate only locate_id tokens... Anyway, after the training my pipeline leads to a single locate_token followed by the EOS, while yours create a single letter and then also generate the locate_token followed by the EOS. Therefore I am not that confident that the model has learned correctly. Do you have some inputs about this behavior and how I could make sure that my training works?
@MichaelHypS Thanks for your attention. I will provide the training logs and checkpoints for your reference next weekend.
@MichaelHypS I've tested the following DocVQA fine-tuning script:
# Step 1: Clone/pull the latest code (updated on 01/09/2024) git clone https://github.com/Veason-silverbullet/ViTLP.git cd ViTLP mkdir -p ckpts/ViTLP-medium git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium # Step 2: Manually download DocVQA document images from https://rrc.cvc.uab.es/?ch=17&com=downloads cd finetuning # Download and extract DocVQA document images into ./DocVQA/documents from https://rrc.cvc.uab.es/?ch=17&com=downloads ls ./DocVQA # The `documents` should be located at `./DocVQA` # bboxes-train-80.npy images.txt qa_span_types-train-80.npy token_types-train-80.npy train-mapping.txt train_v1.0_withQT.json # documents link.py test_v1.0.json tokens-train-80.npy train-metadata.json val_v1.0_withQT.json # Step 3: Fine-tuning DocVQA # Effective batch size = num_nodes * num_gpus * batch_size * gradient_accumulation_steps # Make `Effective batch size` to be 128, set `gradient_accumulation_steps` at `./misc/zero1_fp16-grad_acc-16.json` depending on your computation resources # Since I only have 4 Nvidia-3090 (24G), I have to set gradient_accumulation_steps = 16. nohup deepspeed --num_nodes 1 --num_gpus 4 finetune_docvqa.py --batch_size=2 --deepspeed_config=misc/zero1_fp16-grad_acc-16.json --output_dir=DocVQA-outputs > nohup.out 2>&1 &
Since I only have 4 Nvidia-3090 (24G) at hand, the fine-tuning takes ~6 days. I can only release the full training logs and checkpoints next week. Ideally, if 8 A100 are available, the fine-tuning can be done in hours setting gradient_accumulation_steps = 1
.
Thanks a lot your continuous support, really appreciated, I will try to run your script as well for sanity check and then mine. My setup is rather similar to yours so it may take a bit of time until your hear from me as well :)
@MichaelHypS The DocVQA checkpoint is available at https://drive.google.com/drive/folders/1zZNw76DQTBPBv4Uuw-Bvuba_poYqc8ZK?usp=drive_link. Please feel free to have a shot.
Also, we have some important updates. Please pull the latest commit (10/09/24). The updates include
Increase fine-tuning resolution compared to the pre-training, which is a key to DocVQA performance.
Update fine-tuning data. Previously, some fine-tuning data was missing as the heuristic-rule code finetuning/DocVQA/link.py could not link some boxes to answers. I updated link.py and the fine-tuning data last week.
Please check the latest Readme for DocVQA inference instruction. The checkpoint provided above is for running the finetuning/inference_docvqa.py only. The checkpoint was fine-tuned with old data, its performance might be a little bit inferior. I will fine-tune it with new data this week (and put the official checkpoint to HuggingFace later).
@MichaelHypS , for your request, the training loss curve is below
The training log is also provided below log.zip
Thanks a lot for the amazing work!
Hi, great work and thanks for sharing the code and weights!
I tried the OCR on your sample and it works well. However, may I ask how could we go to perform VQA? Something similar to your paper example: "\<VQA> What's the title?". Could you perhaps give us a snippet for this please?