Open akashlp27 opened 1 year ago
Hey @akashlp27,
would you mind sharing how you trained MBart tokenizer and MBart decoder on arabic corpus? I want to do this for Polish for some time but couldn’t figure out how.
If I do it for Polish, maybe I could also help with resolving nan
issue.
Hey @VictorAtPL , followed the same procedure which was mentioned in https://github.com/clovaai/donut/issues/11#issuecomment-1437140288 but only for single language corpus say Arabic
hey, i was trying to train a model of parsing arabic data, As mentioned by @VictorAtPL , I have trained Mbart tokenizer using arabic corpus, and also trained the Mbart decoder with the same arabic corpus, but while training at line no. 128 in util file
https://github.com/clovaai/donut/blob/15534e5cd33b524bf323752c24081a15680c80a7/donut/util.py#L128
where the
prompt_end_token_id
, and thepad_token_id
is been replaced by -100(ignore id), but with custom tokenizer and decoder all the generated decoder ids are been replaced by -100(ignore id) instead of prompt_end_token_id and pad_token_id only. Because of this the loss isnan
Any suggestions or inputs will help a lot, @gwkrsrch , @Vadkoz ,@NielsRogge Thanks regards.