clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.75k stars 466 forks source link

How to determine the right values for input_size? #119

Open htcml opened 1 year ago

htcml commented 1 year ago

My jpg files have a size around 5000 x 6000. I tried input_size=[1280, 1920] because 6000(img height) > 5000(img width). But it turns out input_size=[1280, 960] and [1920, 1280] outperforms [1280, 1920]. Is there any tips in determining the right input_size based on the image sizes? Or this only can be determined by trials?

VictorAtPL commented 2 months ago

@htcml

The smaller input_size you provide, the larger compression of your original images will be (thumbnail is created out of the original image so some information is lost during resizing).

Worth noticing is that in the config file, the input_size is [height, width], not the opposite. So it's natural that if you images are 5000 in width and 6000 in height, then [1920, 1280] outperforms [1280, 1920].

In the latter scenario, your images (5000x6000) (w x h) are firstly compressed to size of (1066x1280) (w x h), and then padded randomly to size of (1920x1280) (w x h). As you can imagine, there is lots of padded area (1M pixels) which is not used optimally.

In the former scenario, your images (5000x6000) (w x h) are firstly compressed to size of (1280x1536) (w x h), and then padded randomly to size of (1280x1920) (w x h). Less area is padded (0.5M), so the original image takes larger area in the final image provided to the model vs. the previous scenario.