Assess usage of LayoutLM for extracting structural elements of PDFs

deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.76k stars 1.92k forks source link

Assess usage of LayoutLM for extracting structural elements of PDFs #3058

Closed bogdankostic closed 6 months ago

bogdankostic commented 2 years ago

Is your feature request related to a problem? Please describe. LayoutLM is a transformer-based model that is able to take PDFs as input and perform different tasks on them. We should asses whether we can use LayoutLM to convert PDF files to Documents. For this, we should check whether a suitable fine-tuned already exist. If not, it might be necessary to fine-tune a new one for our needs.

One dataset that might be interesting for fine-tuning is DocLayNet, a datset consisting of a variety of different PDFs labeled with regard to their Layout.

anakin87 commented 2 years ago

Hello @bogdankostic and sorry for the intrusion. During the assessment, you might as well have a look at Donut. it looks interesting, even if I don't know how mature it is...

0-hero commented 2 years ago

LayoutLM models work well for invoices and not documents. Worked on a similar use case and used DiT. But I found PaddleOCR's layoutparser model works better and faster for structure recognition. I used bbox's to compare and map text to layout box. Happy to help with this feature!!

anakin87 commented 2 years ago

Interesting HF Linkedin post about Donut fine-tuned for Question Answering.

hammer commented 1 year ago

Having submitted #1404 in 2021, I was excited to see some movement on this topic!

Note that this subfield has moved quickly. If you're still evaluating transformer models for this task I think UDOP looks to be the most promising recent model and will hopefully be on HuggingFace soon: https://github.com/huggingface/transformers/issues/20650. Unfortunately the Microsoft team that trained the model says on that their repo that "Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API", so I guess model weights will have to come from elsewhere...

ziyi-yang commented 1 year ago

Hi @hammer, thanks for your interest in UDOP. We've released the encoder + text decoder model weights at https://huggingface.co/ZinengTang/Udop. By ""Due to fake document...", we mean that we need to release the vision decoding (i.e. document image generation functionality) in a more responsible way with ethical consideration.

julian-risch commented 1 year ago

@bogdankostic @bglearning Could you share an update on Document VQA here? I know you you briefly worked on it and did some research recently. 🙂

PAHXO commented 1 year ago

Hello @bogdankostic and sorry for the intrusion. During the assessment, you might as well have a look at Donut. it looks interesting, even if I don't know how mature it is...

Meta AI released Nougat its current codebase is built on top of Donut. It looks promising, mostly optimized for 'scientific' documents...