Document Information Extraction in Production - How to know the confidence of each field?

clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

https://arxiv.org/abs/2111.15664

MIT License

5.89k stars 477 forks source link

Document Information Extraction in Production - How to know the confidence of each field? #101

Open WaterKnight1998 opened 2 years ago

WaterKnight1998 commented 2 years ago

Good afternoon, thank you very much for the incredible work!

I just wanted to know if there was an automatic way to know if the field is OK. Especially when other applications use the data extracted by the model. That way we could reduce the work of people in the loop as much as possible.

Other architectures like LayoutLM returns the confidence of each field. I know that is difficult to get confidences of each prediction with this architecture. Would it be doable to apply some validations to check if prediction is properly extracted? Do you have other ideas?

Thanks in advance!

TheSeriousProgrammer commented 2 years ago

This issue discusses briefly regarding the same https://github.com/clovaai/donut/issues/37, however this gives out a confidence score for the whole json not for individual entities. This models predicts the whole json as an xml string (just like gpt). So I am not sure of the feasibility in extracting confidence values for specific values. One could extract the confidences of the individual tokens in value fields to get confidence scores accordingly(This should work theoretically but may take a lot of trail and error)

Searching for your doubt in existing github issues with keywords may help you

But please be polite while asking such queries, the clovai team has done a phenomenal job and has made it opensource. if not supportive try not be harsh in your comments in the least 😊

WaterKnight1998 commented 2 years ago

But please be polite while asking such queries, the clovai team has done a phenomenal job and has made it opensource. if not supportive try not be harsh in your comments in the least 😊

Yes, this model is awesome, they did and incredible job. I didn't want to harsh them.

If you suggest me a better title I can update it. I have updated first message, let me know if it is ok.

TheSeriousProgrammer commented 2 years ago

Thanks for updating your comment, it looks fine now 🙌

As on now there is no direct method to extract confidence scores for specific fields But each and every predicted token has a confidence score , so theoretically it should be possible to extract individual token confidence values of a given value field and average them out to get the confidence score of the value

But there is no guarantee to know if the obtained confidence score can be used to convey the correctness of a given value (I.e it may give high confidence scores for logically incorrect but syntactically correct values)

Let's wait for the author's input on the same