Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.57k stars 2.79k forks source link

How to distinguish tables and figures #28244

Closed dunalduck0 closed 1 year ago

dunalduck0 commented 1 year ago

I am using prebuilt-layout to extract tables from PDF papers. In this paper example link, the model mistook the Fig 3 on page 5 as a table (a snapshot of the figure is attached at the end).

My question is two-fold:

  1. Is there a built-in way to recognize figures and, therefore filter them out?
  2. If the answer is no, I wanted to leverage the surrounding text (e.g. "Fig 3" or "Table 2") to recognize tables vs figures. I wanted to understand the data field bounding_regions.polygen. It has four X-Y points. What are these four points in tables/figures objects and what is the unit?

image

xiangyan99 commented 1 year ago

Thanks for reaching out.

Could you tell us which library and version are you using??

dunalduck0 commented 1 year ago

Hi @xiangyan99, I am using azure-ai-formrecognizer==3.2.0

catalinaperalta commented 1 year ago

Thanks for the questions @dunalduck0! There isn't currently an option to enable/disable specifically recognizing figures with prebuilt-layout. Tagging @vkurpad from the service side to provide more insight here.

As for your second question, you can use the properties on bounding region to correlate the other recognized content that falls in the area you wish to search. The points of the polygon are the outline for the specific component. For for instance, the points of the bounding region on a table are those that outline the recognized table in the document. The unit depends on whether it's an image or a PDF. For images the unit is pixels and for PDFs it's inches. Here is the definition of the polygon on bounding region:

        A list of points representing the bounding polygon
        that outlines the document component. The points are listed in
        clockwise order relative to the document component orientation
        starting from the top-left.
        Units are in pixels for images and inches for PDF.
ghost commented 1 year ago

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

dunalduck0 commented 1 year ago

Thank you @catalinaperalta for the answer. I was able to eliminate figures by checking whether the nearest (either above or bottom) text starts with "Figure" or "Fig". I hope it would work for most well-written paper.

I have 3 additional questions about table extraction quality.

  1. Many tables in the papers of my interest contain both column and row headers. But it seems the package recognizes column headers only without row headers, because I found only two types of data in the extraction: columnHeader and content. There is no rowHeader type.
  2. Often the row/column headers are nested and the extraction struggled to understand them. An example is attached below. In the output (prefix columnHeader and content is added artifically), the nested column headers are sometimes merged into a single column header (column 2,3,4,5) while sometimes recognized correctly (6,7,8,9). Anything can be done to improve it?

Original table in paper image

Form-recognizer extraction image

  1. Special symbols, superscripts or subscripts are lost in extracted data (see the same example above). Excel was able to preserve such information (see below), though still struggling with nested headers.

Excel extraction image

catalinaperalta commented 1 year ago

Glad to help @dunalduck0! These are good questions, seems that the prebuilt-layout algorithm is not recognizing all of the elements you're looking for with this set of documents. It might be that a custom model would help improve recognition for your specific set of documents. @vkurpad should prebuilt-layout have the ability to return some of these content elements (such as the rowHeader in addition to columnHeader, nested headers, and the special symbols)?

bojunehsu commented 1 year ago

Hi @dunalduck0,

We are constantly improving our underlying table extraction algorithm. I was able to get the correct nested column headers via https://formrecognizer.appliedai.azure.com/studio/layout (except 2 missed header text). Can you try again? image

We do return rowHeader as a cell type in certain cases. But in this particular table, with no visual indication, it subjective whether the Analog column is a rowHeader. I personally would not label it as such.

The service does not yet support the recognition of super/subscripts, or mathematical formulas in general.

vkurpad commented 1 year ago

I tried the same image in the Studio and got the same result shared by @bojunehsu. Could you try updating to the latest SDK version?

There are a few planned updates that should improve the issues with mathematical formulas.

ghost commented 1 year ago

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!