Hey all, I was trying to train donut model for parsing, which contains Arabic(only) information, in order to achieve this i had collected Arabic corpus from various sources and then trained,
Mbart Tokenizer for arabic corpus.
Mbart decoder with the same dataset.
Initially the model was training well meaning the loss was decreasing gradually but, during Validation, all my dataset tokens are predicting as <UNK> tokens. Because of this the Normed ED value is above 0.9 but still the loss is decreasing.
Is there anything I am missing out , any inputs will help a lot. @gwkrsrch , @Vadkoz ,@NielsRogge
Thanks regards.
Hey all, I was trying to train donut model for parsing, which contains Arabic(only) information, in order to achieve this i had collected
Arabic corpus
from various sources and then trained,Mbart Tokenizer
for arabic corpus.Mbart decoder
with the same dataset.Initially the model was training well meaning the loss was decreasing gradually but, during Validation, all my dataset tokens are predicting as
<UNK>
tokens. Because of this theNormed ED
value is above0.9
but still the loss is decreasing.Is there anything I am missing out , any inputs will help a lot. @gwkrsrch , @Vadkoz ,@NielsRogge Thanks regards.