How to handle multi page invoices

clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

https://arxiv.org/abs/2111.15664

MIT License

5.52k stars 443 forks source link

How to handle multi page invoices #279

Open DNFSiF opened 6 months ago

DNFSiF commented 6 months ago

I try to use donut with a custom invoice dataset to get fields like for example invoice numbers and totals.
The invoices can be single paged or multi paged, so the fields could be across different pages.

Has anyone experience with multi page invoices?
Should I merge the pages together to a single image?
Do I train different models for different page counts?

Thanks for any advise! 😄

felixvor commented 5 months ago

You could think about increasing the input dimensions and forwarding multiple pages as one image, but it does not scale well and no hardware can realistically handle that compute with more than a few pages. What we did was try to find the values we want to label to a page using fuzzy matches in OCR (for example using libraries like rapidfuzz). If we find the label as a substring on a pages OCR, we label that page for the donut training. Maybe that helps you, good luck!

balajiChundi commented 5 months ago

"Sending in multiple pages for each request", if you define your use case like this - model's max_positional_embeddings (you might have to parameter tune) might not be sufficient to incorporate all the info in a single response and higher possibilities of repetition of text. Instead, you can build a single page prediction model at a time and handle the predictions later.

xdevfaheem commented 4 months ago

@balajiChundi can you elaborate a bit what you mean?

balajiChundi commented 2 months ago

First and preferred way: Get the predictions from the model twice, once per each page (for a two page invoice), you can parallelize the model predictions for a faster output. PS: This worked for me. Second : (Didn't work for me), I concatenated the images like stitching them vertically, trained the model. The problem with this is, data prep is very clumsy and time-taking and cannot actually decide on the max_token that we get as output, So this is not at all recommended.