clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.75k stars 466 forks source link

Training Donut for a new language #158

Open Invalid-coder opened 1 year ago

Invalid-coder commented 1 year ago

@josianem @gwkrsrch thank you for a great work!

Could you please help me with donut pretaining for a new language? I am trying to train donut model for ukrainian text. What advice could you give me in terms of tokeneizer and data amount?