Multi modal approach to LP's Deep Layout Parsing capability

nasheedyasin commented 3 years ago

Motivation So basically when it comes to layout Parsing of Forms and other such structured data. I have noticed that just having access to the image features of a region of interest could lead to quite a few false positives. If we could have a multimodal approach where we also take into consideration the text present within these regions, to then form a richer representation, we could considerablbly improve the performance over the existing pure object detection methodology.

Ofcourse this is relevant only for structured documents like forms and invoices. But I'm guessing a vast majority of your users, much like myself would be interested in such a feature.

PS: Would love to work on developing such a feature with you all.

For reference: a form like this.

@lolipopshock

lolipopshock commented 3 years ago

Thanks! As mentioned in the layout-parser paper , this is the direction we are working on right now. I'll share more information in this thread later when there are more updates.

alejandrojcastaneira commented 2 years ago

Hello everyone, any update on this front? for some documents is really impossible to identify the correct layout without incorporating the semantic context of the text.

nasheedyasin commented 2 years ago

Hello everyone, any update on this front? for some documents is really impossible to identify the correct layout without incorporating the semantic context of the text.

The best way forward would be to integrate huggingface-based models like LayoutLMv3 into the Layout Parser ecosystem. I believe there was some work in this direction. @lolipopshock will be able to tell you more.

Layout-Parser / layout-parser

Multi modal approach to LP's Deep Layout Parsing capability #49