clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.75k stars 466 forks source link

[pretrain] read text task data format quesiton #128

Open yysirs opened 1 year ago

yysirs commented 1 year ago

pretrain read text task use english data

{\"text_sequence\": \"HANDEL SDN BHD\"}

if i use chinese data, the data format reasonable as follows:

{\"text_sequence\": \"合 同 号 : 1 2 3 \"}

or

{\"text_sequence\": \"合同号:123\"}

is the first suitable or the second? @logan-markewich @gwkrsrch

logan-markewich commented 1 year ago

I would say use whichever one more closely matches what is written on the document image.

If the text on the image is spaced out like the first Chinese text, then use that one. If not, use the second one.