We try to reproduce the experiments for fine-tuning LayoutLMv3 on DocVQA using both extractive and abstractive approach.
I try to present every single detail within this repository. Note, this is not official codebase from LayoutLMv3.
Work In Progress
pip3 install -r requirements.txt
Some of the code in this repository is adapted from this docvqa repo which works on "LayoutLMv1 for DocVQA".
Note that the test set from the docvqa repo does not come with the ground-truth answers.
docvqa
folder under data
folder.Run the following command to create the huggingface dataset:
python3 -m preprocess.extract_spans
Then you will get a processed called docvqa_cached_extractive_all_lowercase_True_msr_True
More details about the statistics after preprocessing, Check out here.
The final statistics about the number of spans founded is as follows:
Train / #found spans / #not found | Validation / #found spans / #not found | Test |
---|---|---|
39,643 / 36,759 / 2,704 | 5,349 / 4,950 / 399 | 5,188 |
NOTE: The microsft READ API for OCR is not available. Please contact me if you want to use this dataset. (Thanks @redthing1 giving me the access.)
Run accelerate config
to configrate your distributed training environment and run the experiments by
accelerate launch docvqa_main.py --use_generation=0
Set use_generation
to 1 if you want to use the generation model.
My distributed training environment: 6 GPUs
Model | Preprocessing | OCR Engine | Validation ANLS | Test ANLS |
---|---|---|---|---|
LayoutLMv3-base | lowercase inputs | built-in | 68.5% | - |
LayoutLMv3-base | lowercase inputs | Microsoft READ API | 73.3% | 74.24% |
LayoutLMv3-base | original cased | Microsoft READ API | 72.7% | - |
LayoutLMv3-base + Bart Decoder | lowercase | Microsoft READ API | 72.5% | - |
LayoutLMv3-base + Roberta-base | lowercase | Microsoft READ API | 73.0% | - |
The performance is still far behind what is reported in the paper.
Note: Adding sliding window gives me the performance around 64% at the moment. It seems harmful to do so.