[pretrain] read text task data format quesiton

clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

https://arxiv.org/abs/2111.15664

MIT License

5.75k stars 466 forks source link

Open yysirs opened 1 year ago

yysirs commented 1 year ago

pretrain read text task use english data

{\"text_sequence\": \"HANDEL SDN BHD\"}

if i use chinese data, the data format reasonable as follows:

{\"text_sequence\": \"合 同 号 : 1 2 3 \"}

{\"text_sequence\": \"合同号:123\"}

is the first suitable or the second? @logan-markewich @gwkrsrch

logan-markewich commented 1 year ago

I would say use whichever one more closely matches what is written on the document image.

If the text on the image is spaced out like the first Chinese text, then use that one. If not, use the second one.