Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.94k stars 733 forks source link

Misclassification of element types on ADV forms #2541

Open lavish2210 opened 8 months ago

lavish2210 commented 8 months ago

I am using the hi_res model locally and tried it both with and without chunking as well. I also tried the chipper model via api, but faced similar issues as well.

Major issues faced by us while trying it on ADV Brochures -

  1. Classification Issue - There are some cases when the title and its corresponding text are classified in a single token, and this whole underlying text has its parent pointing to the header of the page. For example, the following image is a snippet from page no.-2 of Blackrock pdf(https://files.adviserinfo.sec.gov/IAPD/Content/Common/crd_iapd_Brochure.aspx?BRCHR_VRSN_ID=848663).

image

In the above snippet text Item 2. Material Changes Since the last annual update to the Form ADV Part 2A (the “Brochure”) on March 31, 2022, material changes to this Brochure include amendments to the following items: is classified as a narrative text which ideally should not have been the case.

  1. Table Extraction Issue - The following snippet is taken from page no. 24 of the Blackrock pdf(linked in Issue - 1). image We didn't receive the correct table structure for the above table.

  2. Multicolumn documents - We are not able to get the correct structure for multicolumn PDFs. First, the right column is recognized, and then the left column(and that too row-wise). Ideally, the whole left column must be recognized at once, and then the whole right column. https://files.adviserinfo.sec.gov/IAPD/Content/Common/crd_iapd_Brochure.aspx?BRCHR_VRSN_ID=821958

  3. Chunking issue - In continuation to Issue - 1, if the text is not classified correctly as title then chunking is not also not working correctly as well.

Please provide support on these issues.

MthwRobinson commented 8 months ago

@lavish2210 - Thanks for reporting this. We're currently doing data annotation to improve our partitioning models and will include this in the data set.

lavish2210 commented 8 months ago

It would be great if you could share the timeline by which all the above-listed issues will be solved.

MthwRobinson commented 8 months ago

We'll post timelines on model related updates in our Slack channel