clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.88k stars 477 forks source link

Difficulties finetuning for another language #236

Open lauraminkova opened 1 year ago

lauraminkova commented 1 year ago

Hi there!

First of all, thank you so much for all of your work and the time put into answering everyone's questions in the Issues section!

I've been trying to finetune Donut for French visual question answering, but have encountered lots of issues.

My initial thought process:

  1. Create French SynthDoG data (No problem here)
  2. Finetune donut-base using SynthDoG_fr data (using this notebook as basis https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb)
  3. Finetune donut-SynthDog_fr model on French documents for visual question answering (using this notebook as a basis https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/DocVQA/Fine_tune_Donut_on_DocVQA.ipynb)

In order to account for the change in language, I changed the config_fr.yaml for synthdog to have a french font and french corpus (2.4M). I also changed the tokenizer for both finetuning processes to a French one (and checked that it works - it does!). I even read in #11 that maybe I should change the decoder to one that can better handle the French language, so I did that as well.

Despite these changes, I still get rather poor metrics with both pre-training with SynthDoG_fr (the lowest val_edit_distance is ~ 0.74 after 30 epochs) and finetuning on French documents for VQA (though this is unsurprising given the results from the pre-training). The metrics are visibly bad as well, usually predicting giberish.

Am I missing anything? Any help would be greatly appreciated!

rm-asif-amin commented 1 year ago

Hi Laura, For pre-training with SynthDog, is the pretrained model able to parse plain text? Or does it spew out gibberish? If it does, can you share a single instance of you training data just before calling model.train() ? @lauraminkova

lauraminkova commented 1 year ago

Hi Laura, For pre-training with SynthDog, is the pretrained model able to parse plain text? Or does it spew out gibberish? If it does, can you share a single instance of you training data just before calling model.train() ? @lauraminkova

Hi, thanks so much for your reply @rm-asif-amin ! Since posting my original issue, the model can now predict text in French and is not just predicting gibberish :) However, the training loss is extremely low after ~ 60 epochs, whereas the validation metrics are not doing so great - the minimum edit distance is still around 0.75. Is this typical for finetuning SynthDoG ?

rm-asif-amin commented 1 year ago

Hi @lauraminkova! Glad to know that you're closing in on solving it! I used a different notebook as reference, so didn't use the edit distance metric during training. Can you share your raw log-loss?

For my case, training on 120k examples for 100k steps(batch size 8) was enough. Training loss was very close to zero and validation loss (log loss) was about 0.3 . I went on to about 200k steps just to see how it converges.Sharing my training and validation metrics with you-

Training Metrics- IMG_0301

Validation Metrics- IMG_0302

With this pre-training, I was able to successfully execute the downstream Information Extraction task on Bengali Documents.

This notebook is a walkthrough of my workflow if you want to try an alternate approach- https://colab.research.google.com/drive/1V_vP4p3W874QTSN-EYHOg-t4XGYR8TlP?usp=sharing

eschaffn commented 1 year ago

Hi @lauraminkova! Glad to know that you're closing in on solving it! I used a different notebook as reference, so didn't use the edit distance metric during training. Can you share your raw log-loss?

For my case, training on 120k examples for 100k steps(batch size 8) was enough. Training loss was very close to zero and validation loss (log loss) was about 0.3 . I went on to about 200k steps just to see how it converges.Sharing my training and validation metrics with you-

Training Metrics- IMG_0301

Validation Metrics- IMG_0302

With this pre-training, I was able to successfully execute the downstream Information Extraction task on Bengali Documents.

This notebook is a walkthrough of my workflow if you want to try an alternate approach- https://colab.research.google.com/drive/1V_vP4p3W874QTSN-EYHOg-t4XGYR8TlP?usp=sharing

Hey there! I'm looking to fine-tune donut on a few different languages and would love to look at your notebook as a starting point! I've requested access via Google drive! Thanks in advance

rm-asif-amin commented 1 year ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

eschaffn commented 1 year ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

Thank you!

Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?

rm-asif-amin commented 1 year ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

Thank you!

Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?

This is retraining/finetuning the base model to adapt it to a new language(read text). For higher level tasks you mention, it needs to be trained/finetuned again with the specific task. Yes, Synthdog was used but this notebook doesn't show the Synthdog data.

eschaffn commented 1 year ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

Thank you! Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?

This is retraining/finetuning the base model to adapt it to a new language(read text). For higher level tasks you mention, it needs to be trained/finetuned again with the specific task. Yes, Synthdog was used but this notebook doesn't show the Synthdog data.

I see thank you! I think my last questions is, is SynthDoG able to generate data for multilingual layout extraction too? Something like FUNDSD or XFUND, I'm not too concerned with receipts so CORD is not useful for me. But I can't find any layout datasets for languages like Russian with annotated tables, titles, headers, footers, etc.

Would a Russian Donut model fine-tuned on FUNSD be able to generalize to Russian document layout analysis?

doduythao commented 3 months ago

@rm-asif-amin @lauraminkova Can any of you please share @rm-asif-amin notebook so I can follow his/her configure? I tried a many configure for finetune on Vietnamese lang but no good result so far, still high ED. (tried to create synthetic to support real one, 40k samples, around 30 epochs trained but norm ED still above 0.14. even tesseract can achieve 0.08 for easy to imagine)

rm-asif-amin commented 3 months ago

@doduythao I have shared my notebook with you, please check.

ZiedChekir commented 3 months ago

Hi, I am trying to fine-tune donut for french documents. I wanted to ask you how did it go? Also, could you share the tokenizer you used for french? And is it possible to fine tune it for both english and french using one tokenizer? Thank you!

ThaiTOm commented 1 month ago

@rm-asif-amin can you share your notebook with me, thanks a lot.