Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

feat: enhance analysis options with od model dump and better vis #3234

Closed pawel-kmiecik closed 5 days ago

pawel-kmiecik commented 1 week ago

This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis.

sentry-io[bot] commented 1 week ago

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: unstructured/partition/pdf.py

Function Unhandled Issue
_partition_pdf_or_image_local IndexError: list index out of range /general/v0/g...
Event Count: 12

Did you find this useful? React with a 👍 or 👎

pawel-kmiecik commented 1 week ago

Drawing bboxes for OCR layout doesn't appear to be working.

PDF: 2023_SustainabilityReport_33.pdf

elements = partition_pdf(
    filename="2023_SustainabilityReport_33.pdf",
    strategy=strategy,
    analysis=True,
)

Results:

page1_layout_ocr

page1_layout_od_model

page1_layout_pdfminer

page1_layout_final

This should be fixed now: page1_layout_ocr