Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

feat/bbox_scaling_parameter #3235

Open LesykDev opened 1 week ago

LesykDev commented 1 week ago

Is your feature request related to a problem? Please describe. Sometimes when partitioning a pdf with plots and tables, the plot title is being cropped off by the bounding box, which leads to you losing important context for summarization LLM.

Describe the solution you'd like Add a bbox scale parameter to the partition function to increase/decrease the bounding box size.

Describe alternatives you've considered Don't know of any alternatives, other than maybe changing the detection model.

Additional context The change should be done in file: unstructured\partition\pdf_image\pdf_image_utils.py at line 183 (ver. 0.14.6). Example of the code that should fix the issue:

offset = 0.18 # Should be a parameter
padded_bbox = cast(
    Tuple[int, int, int, int], pad_bbox((x1*(1-offset), y1*(1-offset), x2*(1+offset), y2*(1+offset)), (h_padding, v_padding))
)
christinestraub commented 1 week ago

Hi @LesykDev, Can you please provide a PDF document illustrating this issue?

LesykDev commented 1 week ago

Sure, @christinestraub — https://liftoff.energy.gov/wp-content/uploads/2023/05/20230523-Pathways-to-Commercial-Liftoff-Clean-Hydrogen.pdf

It is clearly visible on graphs at pages 9, 18 and 96. Python 3.11.0, my unstructured version is 0.12.2

MthwRobinson commented 1 hour ago

@LesykDev - thanks for reporting this issue. If you'd like to open a PR with your suggested fix, we'd be happy to review. Otherwise we'll pick this up as soon as we're able.