clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.53k stars 444 forks source link

Issue while training with custom decoder and tokenizer #215

Open akashlp27 opened 1 year ago

akashlp27 commented 1 year ago

hey, i was trying to train a model of parsing arabic data, As mentioned by @VictorAtPL , I have trained Mbart tokenizer using arabic corpus, and also trained the Mbart decoder with the same arabic corpus, but while training at line no. 128 in util file

https://github.com/clovaai/donut/blob/15534e5cd33b524bf323752c24081a15680c80a7/donut/util.py#L128

where the prompt_end_token_id, and the pad_token_id is been replaced by -100(ignore id), but with custom tokenizer and decoder all the generated decoder ids are been replaced by -100(ignore id) instead of prompt_end_token_id and pad_token_id only. Because of this the loss is nan

Any suggestions or inputs will help a lot, @gwkrsrch , @Vadkoz ,@NielsRogge Thanks regards.

VictorAtPL commented 1 year ago

Hey @akashlp27,

would you mind sharing how you trained MBart tokenizer and MBart decoder on arabic corpus? I want to do this for Polish for some time but couldn’t figure out how.

If I do it for Polish, maybe I could also help with resolving nan issue.

akashlp27 commented 1 year ago

Hey @VictorAtPL , followed the same procedure which was mentioned in https://github.com/clovaai/donut/issues/11#issuecomment-1437140288 but only for single language corpus say Arabic