clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.74k stars 466 forks source link

Donut Return Output even With Blank Image #272

Open wdprsto opened 10 months ago

wdprsto commented 10 months ago

Hello sir, I was trying to understand how does Donut work. As Donut is an OCR-Free multimodal transformer, which part of the Model architecture that can lead to this false inference? Is it the encoder or decoder? Is it right that it is because of the Autoregressive decoder? I would be very grateful if someone would explain it to me, since I still cant understand how Donut could predict output even with blank image, and sometimes it also generate wrong text from existing trained data.

balajiChundi commented 7 months ago

I have faced a similar problem and identified that the training set has some ambiguous entries, a) your training set might be having outliers, in this case, text that is not present in the image but present in the ground truth (output) of a training sample. CLEAN UP OF TRAINING DATA b) Training performed for very less number of epochs. SET IT FOR FURTHER TRAINING, CKPT TRIANING c) max_positional_embeddings parameter is set too large (although I don't think this is the case here).

wdprsto commented 7 months ago

Allright, I'll try to make sure that the data is labeled properly. In the other hand, do you have any recommended number of training epochs based on your research? That would be helpful. Thanks!

balajiChundi commented 7 months ago

I have around 15k images and trained it for 10 epochs, which resulted in repetitions in some cases - around 10-15% of the images I validated on has this problem rest all are almost fine. So I have initiated for another 10 epochs but I think only decoder is not trained enough - freezing encoder's weights and training also might help in my case. I'll update once it trains completely.

balajiChundi commented 7 months ago

I figured I have made a mistake in preparing the training data, so even after training it for 20 epochs there are repetitions. I rectified the error in data prep and trained it for 5 epochs, it's performance has exceeded that of previous models. Mistake : lines in ground_truth when I created gt_parse are not in order.

wdprsto commented 7 months ago

I also found the order of the ground truth affecting the model performance. Maybe It's related to how the swin encoder and mbart decoder work. Anyway, I was wondering, what number did you use for the batch size? Is it affecting the performance?