Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.51k stars 590 forks source link

bug/bounding boxes using strategy="hi_res" are wrong #3100

Open mandar-karhade opened 1 month ago

mandar-karhade commented 1 month ago

Describe the bug When using the coordinates of elements for bounding boxes, the coordinates are different using default strategy and 'hi_res' strategy.

To Reproduce

sudo apt-get install -y poppler-utils  tesseract-ocr
pip install "unstructured[pdf]==0.12.5" PyMuPDF poppler-utils unstructured_inference==0.7.23 
#Image.open() issue with higher version of unstructured_interface 0.7.24 has compatibility issue with unstructured 0.12.5 so downgrading to 0.7.23 

# Partition the PDF into chunks
import fitz
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Element

elements_high_res = partition_pdf(
                        filename=document, 
                        chunk_size=chunk_size, 
                        extract_images_in_pdf=True,
                        extract_image_block_output_dir="/content/images",
                        strategy = "hi_res",
                        use_gpu=True
                         )

elements = partition_pdf(
                        filename=document, 
                        chunk_size=chunk_size
                         )

document = "/content/1706.03762v7.pdf"

# Using hi_res strategy
output_pdf_path = "/content/1706.03762v7_modded_high_res.pdf"
chunk_size = 0 
pdf_document = fitz.open(document)

for element in elements_high_res:
    if isinstance(element, Element):
        page_number = element.metadata.page_number
        bbox = element.metadata.coordinates.to_dict()
        top_left, bottom_right = bbox['points'][0], bbox['points'][2]
        if page_number is not None and bbox is not None:
            page = pdf_document[page_number - 1]  # PyMuPDF uses 0-based indexing for pages
            rect = fitz.Rect(top_left, bottom_right)
            page.draw_rect(rect, color=(1, 0, 0), width=2)  # Draw a red rectangle with a width of 2

# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()

# Using default strategy
output_pdf_path = "/content/1706.03762v7_modded.pdf"
chunk_size = 0 
pdf_document = fitz.open(document)

for element in elements:
    if isinstance(element, Element):
        page_number = element.metadata.page_number
        bbox = element.metadata.coordinates.to_dict()
        top_left, bottom_right = bbox['points'][0], bbox['points'][2]
        if page_number is not None and bbox is not None:
            page = pdf_document[page_number - 1]  # PyMuPDF uses 0-based indexing for pages
            rect = fitz.Rect(top_left, bottom_right)
            page.draw_rect(rect, color=(1, 0, 0), width=2)  # Draw a red rectangle with a width of 2

# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
[1706.03762v7_modded_high_res.pdf](https://github.com/Unstructured-IO/unstructured/files/15441444/1706.03762v7_modded_high_res.pdf)
[1706.03762v7_modded.pdf](https://github.com/Unstructured-IO/unstructured/files/15441445/1706.03762v7_modded.pdf)
[1712.05889v2.pdf](https://github.com/Unstructured-IO/unstructured/files/15441446/1712.05889v2.pdf)
[1706.03762v7.pdf](https://github.com/Unstructured-IO/unstructured/files/15441447/1706.03762v7.pdf)

Expected behavior The bounding boxes should not change due to the strategy change

Screenshots Screenshots are attached as PDF but still here is a screenshot: Default strategy default_strategy high res strategy hi_res_strategy

Environment Info Please run python scripts/collect_env.py and paste the output here. This will help us understand more about the environment in which the bug occurred. Public workbook link https://colab.research.google.com/drive/1z2dwE9t6zsgTcejx9RQzj_nTDHOdS4Vv?usp=sharing

Additional context None

MthwRobinson commented 1 month ago

@leah1985 - Does this seem like an issue with the model output or a pre/post-processing issue?

christinestraub commented 2 weeks ago

@MthwRobinson I think this is not a "hi_res" strategy issue but a "fast" strategy issue due to CoordinateSystem. I'll take a closer look at this issue.

MthwRobinson commented 1 week ago

Sounds good - thanks!