Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.49k stars 584 forks source link

bug/Failure to recognize footer and page number ,incorrectly classifies as a Narrative text #3167

Closed tanzeel291994 closed 3 weeks ago

tanzeel291994 commented 3 weeks ago

Describe the bug Using partition_pdf locally with "yolox" model fails to classify the footer section

To Reproduce pdf_elements = partition_pdf(filepath,infer_table_structure=True,strategy="hi_res",hi_res_model_name="yolox_quantized") json = elements_to_json(pdf_elements) print(json)

Expected behavior According to documentation should identify it as "Footer" https://docs.unstructured.io/open-source/concepts/document-elements

Screenshots image image

MthwRobinson commented 3 weeks ago

Hi @tanzeel291994 - thanks for reporting. If you need more accurate element classification results, consider using our API. The document understanding model available through the API is more accurate and incremental improvements to the model will be deployed there.

cc @leah1985