clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.52k stars 443 forks source link

Two types of documents in one model? #256

Open henkish opened 9 months ago

henkish commented 9 months ago

Hello!

We are trying to read both data from electronic invoices (typically well structured PDF files with good information) and from cashier receipts (like the ones in CORDS dataset).

What would be best approach to handle both types of documents? My approach now is to train 3 models: document classification, invoice parser and cashier receipt parser. And then first run document classification and then decide what model to run next.

My wondering is if I could combine everything into one model. Invoices has some additional fields (due date for instance) - but other than the additional fields - all other fields are same. Is it possible for instance to add "class" field into the data - and then train on all documents in one model?

felixvor commented 5 months ago

That should definitely be possible, donut just trains on generating json output, no matter if the json delivers class names or value-extractions. the only difference for classification is that classes are added to the tokenizer as their own special tokens, so the model can learn new class names from scratch instead of puzzling existing tokens together to spell them out.

Did you experiment further on this? Did you observe a meaningful performance differences between dedicated and combined models?