Fine-tuning LayoutLMv3 on DocVQA

We try to reproduce the experiments for fine-tuning LayoutLMv3 on DocVQA using both extractive and abstractive approach.

I try to present every single detail within this repository. Note, this is not official codebase from LayoutLMv3.

Work In Progress

Install Requirements

pip3 install -r requirements.txt

Some of the code in this repository is adapted from this docvqa repo which works on "LayoutLMv1 for DocVQA".

Note that the test set from the docvqa repo does not come with the ground-truth answers.

Download the dataset from the DocVQA Website and put docvqa folder under data folder.
Run the following command to create the huggingface dataset:
```
python3 -m preprocess.extract_spans
```
Then you will get a processed called docvqa_cached_extractive_all_lowercase_True_msr_True More details about the statistics after preprocessing, Check out here. The final statistics about the number of spans founded is as follows:

Train / #found spans / #not found Validation / #found spans / #not found Test

39,643 / 36,759 / 2,704 5,349 / 4,950 / 399 5,188

NOTE: The microsft READ API for OCR is not available. Please contact me if you want to use this dataset. (Thanks @redthing1 giving me the access.)

Train / #found spans / #not found	Validation / #found spans / #not found	Test
39,643 / 36,759 / 2,704	5,349 / 4,950 / 399	5,188

Model	Preprocessing	OCR Engine	Validation ANLS	Test ANLS
LayoutLMv3-base	lowercase inputs	built-in	68.5%	-
LayoutLMv3-base	lowercase inputs	Microsoft READ API	73.3%	74.24%
LayoutLMv3-base	original cased	Microsoft READ API	72.7%	-
LayoutLMv3-base + Bart Decoder	lowercase	Microsoft READ API	72.5%	-
LayoutLMv3-base + Roberta-base	lowercase	Microsoft READ API	73.0%	-

The performance is still far behind what is reported in the paper.

Note: Adding sliding window gives me the performance around 64% at the moment. It seems harmful to do so.

[X] Code for tokenization and Collating. (:white_check_mark:)
[x] Code for Training (:white_check_mark:)
[x] Further tune the performance by hyperparameters/casing issue (:white_check_mark:)
[x] Add a decoder for generation (:white_check_mark:)
[x] Sliding window to handle the issue that the matched answers are out of the 512 tokens. (:white_check_mark:)