clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.66k stars 460 forks source link

HuggingFace VisualEncoderDecoderModel performs better (but slower) #118

Open maxjay opened 1 year ago

maxjay commented 1 year ago

I first used the huggingface implementation of Donut to train for my specific use-case and it was working well but very slow, (10x times on epochs for example).

So I decided to switch to the official implementation and saw a massive increase in my training times, however I noticed that the valuation loss was actually worse for the same relative epoch time for training and this was apparent when doing inference.

Here are some graphs to illustrate:

Screenshot 2022-12-31 at 13 05 23

The green line is the huggingface implementation and the others is donut installed from source from this git repo.

As you can see, the the training time is much larger for the same relative epochs (its the same dataset across all of them), however the score is also better for the huggingface implementation.

I printed out both models to check their differences, which can be found here https://www.diffchecker.com/11ZHWVyn/

I'm a novice when it comes to machine learning, but I think that these models are essentially the same...

So why is one performing better than the other, but slower, and the other faster, but worse.

Could it be something to do with this?

Some weights of DonutModel were not initialized from the model checkpoint at naver-clova-ix/donut-base and are newly initialized because the shapes did not match:
- encoder.model.layers.0.blocks.1.attn_mask: found shape torch.Size([3072, 100, 100]) in the checkpoint and torch.Size([768, 100, 100]) in the model instantiated
- encoder.model.layers.1.blocks.1.attn_mask: found shape torch.Size([768, 100, 100]) in the checkpoint and torch.Size([192, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.1.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.3.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.5.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.7.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.9.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.11.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.13.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
NielsRogge commented 1 year ago

Hi,

The HuggingFace implementation should have the same speed. Check out my demo notebooks here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut.

I fine-tuned Donut on various datasets, didn't see any issues regarding speed.

khadkechetan commented 1 year ago

@NielsRogge it is slow. I tried training for 3000 images using donut, it takes around 40 hours. I am using the same script you provided. How can we improve that?