clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.53k stars 444 forks source link

Difficulties finetuning for another language #236

Open lauraminkova opened 11 months ago

lauraminkova commented 11 months ago

Hi there!

First of all, thank you so much for all of your work and the time put into answering everyone's questions in the Issues section!

I've been trying to finetune Donut for French visual question answering, but have encountered lots of issues.

My initial thought process:

  1. Create French SynthDoG data (No problem here)
  2. Finetune donut-base using SynthDoG_fr data (using this notebook as basis https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb)
  3. Finetune donut-SynthDog_fr model on French documents for visual question answering (using this notebook as a basis https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/DocVQA/Fine_tune_Donut_on_DocVQA.ipynb)

In order to account for the change in language, I changed the config_fr.yaml for synthdog to have a french font and french corpus (2.4M). I also changed the tokenizer for both finetuning processes to a French one (and checked that it works - it does!). I even read in #11 that maybe I should change the decoder to one that can better handle the French language, so I did that as well.

Despite these changes, I still get rather poor metrics with both pre-training with SynthDoG_fr (the lowest val_edit_distance is ~ 0.74 after 30 epochs) and finetuning on French documents for VQA (though this is unsurprising given the results from the pre-training). The metrics are visibly bad as well, usually predicting giberish.

Am I missing anything? Any help would be greatly appreciated!

rm-asif-amin commented 11 months ago

Hi Laura, For pre-training with SynthDog, is the pretrained model able to parse plain text? Or does it spew out gibberish? If it does, can you share a single instance of you training data just before calling model.train() ? @lauraminkova

lauraminkova commented 11 months ago

Hi Laura, For pre-training with SynthDog, is the pretrained model able to parse plain text? Or does it spew out gibberish? If it does, can you share a single instance of you training data just before calling model.train() ? @lauraminkova

Hi, thanks so much for your reply @rm-asif-amin ! Since posting my original issue, the model can now predict text in French and is not just predicting gibberish :) However, the training loss is extremely low after ~ 60 epochs, whereas the validation metrics are not doing so great - the minimum edit distance is still around 0.75. Is this typical for finetuning SynthDoG ?

rm-asif-amin commented 10 months ago

Hi @lauraminkova! Glad to know that you're closing in on solving it! I used a different notebook as reference, so didn't use the edit distance metric during training. Can you share your raw log-loss?

For my case, training on 120k examples for 100k steps(batch size 8) was enough. Training loss was very close to zero and validation loss (log loss) was about 0.3 . I went on to about 200k steps just to see how it converges.Sharing my training and validation metrics with you-

Training Metrics- IMG_0301

Validation Metrics- IMG_0302

With this pre-training, I was able to successfully execute the downstream Information Extraction task on Bengali Documents.

This notebook is a walkthrough of my workflow if you want to try an alternate approach- https://colab.research.google.com/drive/1V_vP4p3W874QTSN-EYHOg-t4XGYR8TlP?usp=sharing

eschaffn commented 10 months ago

Hi @lauraminkova! Glad to know that you're closing in on solving it! I used a different notebook as reference, so didn't use the edit distance metric during training. Can you share your raw log-loss?

For my case, training on 120k examples for 100k steps(batch size 8) was enough. Training loss was very close to zero and validation loss (log loss) was about 0.3 . I went on to about 200k steps just to see how it converges.Sharing my training and validation metrics with you-

Training Metrics- IMG_0301

Validation Metrics- IMG_0302

With this pre-training, I was able to successfully execute the downstream Information Extraction task on Bengali Documents.

This notebook is a walkthrough of my workflow if you want to try an alternate approach- https://colab.research.google.com/drive/1V_vP4p3W874QTSN-EYHOg-t4XGYR8TlP?usp=sharing

Hey there! I'm looking to fine-tune donut on a few different languages and would love to look at your notebook as a starting point! I've requested access via Google drive! Thanks in advance

rm-asif-amin commented 10 months ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

eschaffn commented 10 months ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

Thank you!

Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?

rm-asif-amin commented 10 months ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

Thank you!

Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?

This is retraining/finetuning the base model to adapt it to a new language(read text). For higher level tasks you mention, it needs to be trained/finetuned again with the specific task. Yes, Synthdog was used but this notebook doesn't show the Synthdog data.

eschaffn commented 10 months ago

Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.

Thank you! Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?

This is retraining/finetuning the base model to adapt it to a new language(read text). For higher level tasks you mention, it needs to be trained/finetuned again with the specific task. Yes, Synthdog was used but this notebook doesn't show the Synthdog data.

I see thank you! I think my last questions is, is SynthDoG able to generate data for multilingual layout extraction too? Something like FUNDSD or XFUND, I'm not too concerned with receipts so CORD is not useful for me. But I can't find any layout datasets for languages like Russian with annotated tables, titles, headers, footers, etc.

Would a Russian Donut model fine-tuned on FUNSD be able to generalize to Russian document layout analysis?