NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.99k stars 1.39k forks source link

TrOCR image resizing #124

Open siljuovix opened 2 years ago

siljuovix commented 2 years ago

I have been checking the Fine-tune TrOCR on the IAM Handwriting Database tutorial and was wondering about how it works. The paper explains that the text images are first resized into 384 × 384 and then the image is split into a sequence of 16 × 16 patches. The code also reflects that. But the resizing makes the line images look very distorted. Moreover, can someone explain why the image inside the red rectangle is not square?

patches

I wonder if that is just a representation (which is a bit confusing), or if I am missing something on the way the model was trained. How can the model learn a downstream task with such distorted input on the IAM dataset?

And finally, has anyone managed to obtain CER below 5% on the IAM dataset with TrOCR?

Mohammed20201991 commented 1 year ago

can someone explain why the image inside the red rectangle is not square? I think the answer to this is that it depends on the line segment as was mentioned in the original TrOCR paper . You can resize your image before fed to TrOCR.

bely66 commented 2 months ago

Hi @NielsRogge

Any direction?

NielsRogge commented 2 months ago

Hi yes I think that TrOCR effectively distorts the aspect ratio of the image as it gets resized to either a 224 or 384 squared image, just like the original Vision Transformer.

Nowadays models always keep the aspect ratio of the image which results in better performance. Examples of these include Donut, Pix2Struct, etc.

So I assume the figure in the TrOCR paper isn't respecting the code