Closed bogdankostic closed 6 months ago
Hello @bogdankostic and sorry for the intrusion. During the assessment, you might as well have a look at Donut. it looks interesting, even if I don't know how mature it is...
LayoutLM models work well for invoices and not documents. Worked on a similar use case and used DiT. But I found PaddleOCR's layoutparser model works better and faster for structure recognition. I used bbox's to compare and map text to layout box. Happy to help with this feature!!
Having submitted #1404 in 2021, I was excited to see some movement on this topic!
Note that this subfield has moved quickly. If you're still evaluating transformer models for this task I think UDOP looks to be the most promising recent model and will hopefully be on HuggingFace soon: https://github.com/huggingface/transformers/issues/20650. Unfortunately the Microsoft team that trained the model says on that their repo that "Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API", so I guess model weights will have to come from elsewhere...
Hi @hammer, thanks for your interest in UDOP. We've released the encoder + text decoder model weights at https://huggingface.co/ZinengTang/Udop. By ""Due to fake document...", we mean that we need to release the vision decoding (i.e. document image generation functionality) in a more responsible way with ethical consideration.
@bogdankostic @bglearning Could you share an update on Document VQA here? I know you you briefly worked on it and did some research recently. 🙂
Hello @bogdankostic and sorry for the intrusion. During the assessment, you might as well have a look at Donut. it looks interesting, even if I don't know how mature it is...
Meta AI released Nougat its current codebase is built on top of Donut. It looks promising, mostly optimized for 'scientific' documents...
Is your feature request related to a problem? Please describe. LayoutLM is a transformer-based model that is able to take PDFs as input and perform different tasks on them. We should asses whether we can use LayoutLM to convert PDF files to Documents. For this, we should check whether a suitable fine-tuned already exist. If not, it might be necessary to fine-tune a new one for our needs.
One dataset that might be interesting for fine-tuning is DocLayNet, a datset consisting of a variety of different PDFs labeled with regard to their Layout.