Issue while training Donut model for parsing with custom decoder and tokenizer

Hey all, I was trying to train donut model for parsing, which contains Arabic(only) information, in order to achieve this i had collected Arabic corpus from various sources and then trained,

Mbart Tokenizer for arabic corpus.
Mbart decoder with the same dataset.

Initially the model was training well meaning the loss was decreasing gradually but, during Validation, all my dataset tokens are predicting as <UNK> tokens. Because of this the Normed ED value is above 0.9 but still the loss is decreasing.

Is there anything I am missing out , any inputs will help a lot. @gwkrsrch , @Vadkoz ,@NielsRogge Thanks regards.

NielsRogge / Transformers-Tutorials

Issue while training Donut model for parsing with custom decoder and tokenizer #326