Open lauraminkova opened 1 year ago
Hi Laura, For pre-training with SynthDog, is the pretrained model able to parse plain text? Or does it spew out gibberish? If it does, can you share a single instance of you training data just before calling model.train() ? @lauraminkova
Hi Laura, For pre-training with SynthDog, is the pretrained model able to parse plain text? Or does it spew out gibberish? If it does, can you share a single instance of you training data just before calling model.train() ? @lauraminkova
Hi, thanks so much for your reply @rm-asif-amin ! Since posting my original issue, the model can now predict text in French and is not just predicting gibberish :) However, the training loss is extremely low after ~ 60 epochs, whereas the validation metrics are not doing so great - the minimum edit distance is still around 0.75. Is this typical for finetuning SynthDoG ?
Hi @lauraminkova! Glad to know that you're closing in on solving it! I used a different notebook as reference, so didn't use the edit distance metric during training. Can you share your raw log-loss?
For my case, training on 120k examples for 100k steps(batch size 8) was enough. Training loss was very close to zero and validation loss (log loss) was about 0.3 . I went on to about 200k steps just to see how it converges.Sharing my training and validation metrics with you-
Training Metrics-
Validation Metrics-
With this pre-training, I was able to successfully execute the downstream Information Extraction task on Bengali Documents.
This notebook is a walkthrough of my workflow if you want to try an alternate approach- https://colab.research.google.com/drive/1V_vP4p3W874QTSN-EYHOg-t4XGYR8TlP?usp=sharing
Hi @lauraminkova! Glad to know that you're closing in on solving it! I used a different notebook as reference, so didn't use the edit distance metric during training. Can you share your raw log-loss?
For my case, training on 120k examples for 100k steps(batch size 8) was enough. Training loss was very close to zero and validation loss (log loss) was about 0.3 . I went on to about 200k steps just to see how it converges.Sharing my training and validation metrics with you-
Training Metrics-
Validation Metrics-
With this pre-training, I was able to successfully execute the downstream Information Extraction task on Bengali Documents.
This notebook is a walkthrough of my workflow if you want to try an alternate approach- https://colab.research.google.com/drive/1V_vP4p3W874QTSN-EYHOg-t4XGYR8TlP?usp=sharing
Hey there! I'm looking to fine-tune donut on a few different languages and would love to look at your notebook as a starting point! I've requested access via Google drive! Thanks in advance
Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.
Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.
Thank you!
Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?
Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.
Thank you!
Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?
This is retraining/finetuning the base model to adapt it to a new language(read text). For higher level tasks you mention, it needs to be trained/finetuned again with the specific task. Yes, Synthdog was used but this notebook doesn't show the Synthdog data.
Hi @eschaffn , You should be now able to access it. Please note that the notebook is primarily for debugging. It closely resembles my actual workflow of the pretraining process with some minor changes.
Thank you! Is this finetuning the base model to perform just simple OCR or is it possible to do layout analysis too, such as labeling equations, titles, headers, footers, tables, etc.? Also did you create the data using the SynthDoG?
This is retraining/finetuning the base model to adapt it to a new language(read text). For higher level tasks you mention, it needs to be trained/finetuned again with the specific task. Yes, Synthdog was used but this notebook doesn't show the Synthdog data.
I see thank you! I think my last questions is, is SynthDoG able to generate data for multilingual layout extraction too? Something like FUNDSD or XFUND, I'm not too concerned with receipts so CORD is not useful for me. But I can't find any layout datasets for languages like Russian with annotated tables, titles, headers, footers, etc.
Would a Russian Donut model fine-tuned on FUNSD be able to generalize to Russian document layout analysis?
@rm-asif-amin @lauraminkova Can any of you please share @rm-asif-amin notebook so I can follow his/her configure? I tried a many configure for finetune on Vietnamese lang but no good result so far, still high ED. (tried to create synthetic to support real one, 40k samples, around 30 epochs trained but norm ED still above 0.14. even tesseract can achieve 0.08 for easy to imagine)
@doduythao I have shared my notebook with you, please check.
Hi, I am trying to fine-tune donut for french documents. I wanted to ask you how did it go? Also, could you share the tokenizer you used for french? And is it possible to fine tune it for both english and french using one tokenizer? Thank you!
@rm-asif-amin can you share your notebook with me, thanks a lot.
Hi there!
First of all, thank you so much for all of your work and the time put into answering everyone's questions in the Issues section!
I've been trying to finetune Donut for French visual question answering, but have encountered lots of issues.
My initial thought process:
In order to account for the change in language, I changed the config_fr.yaml for synthdog to have a french font and french corpus (2.4M). I also changed the tokenizer for both finetuning processes to a French one (and checked that it works - it does!). I even read in #11 that maybe I should change the decoder to one that can better handle the French language, so I did that as well.
Despite these changes, I still get rather poor metrics with both pre-training with SynthDoG_fr (the lowest val_edit_distance is ~ 0.74 after 30 epochs) and finetuning on French documents for VQA (though this is unsurprising given the results from the pre-training). The metrics are visibly bad as well, usually predicting giberish.
Am I missing anything? Any help would be greatly appreciated!