Bounding boxes required for pretraining?

clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

https://arxiv.org/abs/2111.15664

MIT License

5.52k stars 443 forks source link

Bounding boxes required for pretraining? #277

Open mustaszewski opened 6 months ago

mustaszewski commented 6 months ago

Does the pre-training of Donut require bounding boxes of individual words? In the synthetically generated SynthDoG dataset (https://huggingface.co/datasets/naver-clova-ix/synthdog-en), which was also used for Donut pretraining, there are no bounding boxes, so I assume that the visual corpus described in the paper also lacks boundig box coordinates.

felixvor commented 5 months ago

Im not one of the authors, but as far as I understood Donut only pre-trained on the generated OCR, not the hOCR which would include bounding boxes. Models like UDOP, LILT or LayoutLM come to mind, which do pretty much what you desribe for pre-training and they get good results with the approach.