HuggingFace VisualEncoderDecoderModel performs better (but slower)

maxjay commented 1 year ago

I first used the huggingface implementation of Donut to train for my specific use-case and it was working well but very slow, (10x times on epochs for example).

So I decided to switch to the official implementation and saw a massive increase in my training times, however I noticed that the valuation loss was actually worse for the same relative epoch time for training and this was apparent when doing inference.

Here are some graphs to illustrate:

The green line is the huggingface implementation and the others is donut installed from source from this git repo.

As you can see, the the training time is much larger for the same relative epochs (its the same dataset across all of them), however the score is also better for the huggingface implementation.

I printed out both models to check their differences, which can be found here https://www.diffchecker.com/11ZHWVyn/

I'm a novice when it comes to machine learning, but I think that these models are essentially the same...

So why is one performing better than the other, but slower, and the other faster, but worse.

Could it be something to do with this?

Some weights of DonutModel were not initialized from the model checkpoint at naver-clova-ix/donut-base and are newly initialized because the shapes did not match:
- encoder.model.layers.0.blocks.1.attn_mask: found shape torch.Size([3072, 100, 100]) in the checkpoint and torch.Size([768, 100, 100]) in the model instantiated
- encoder.model.layers.1.blocks.1.attn_mask: found shape torch.Size([768, 100, 100]) in the checkpoint and torch.Size([192, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.1.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.3.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.5.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.7.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.9.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.11.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated
- encoder.model.layers.2.blocks.13.attn_mask: found shape torch.Size([192, 100, 100]) in the checkpoint and torch.Size([48, 100, 100]) in the model instantiated

NielsRogge commented 1 year ago

Hi,

The HuggingFace implementation should have the same speed. Check out my demo notebooks here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut.

I fine-tuned Donut on various datasets, didn't see any issues regarding speed.

khadkechetan commented 1 year ago

@NielsRogge it is slow. I tried training for 3000 images using donut, it takes around 40 hours. I am using the same script you provided. How can we improve that?

clovaai / donut

HuggingFace VisualEncoderDecoderModel performs better (but slower) #118