NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.48k stars 1.33k forks source link

Issue while training Donut model for parsing with custom decoder and tokenizer #326

Open akashlp27 opened 1 year ago

akashlp27 commented 1 year ago

Hey all, I was trying to train donut model for parsing, which contains Arabic(only) information, in order to achieve this i had collected Arabic corpus from various sources and then trained,

  1. Mbart Tokenizer for arabic corpus.
  2. Mbart decoder with the same dataset.

Initially the model was training well meaning the loss was decreasing gradually but, during Validation, all my dataset tokens are predicting as <UNK> tokens. Because of this the Normed ED value is above 0.9 but still the loss is decreasing.

Is there anything I am missing out , any inputs will help a lot. @gwkrsrch , @Vadkoz ,@NielsRogge Thanks regards.