Open LesykDev opened 1 week ago
Hi @LesykDev, Can you please provide a PDF document illustrating this issue?
Sure, @christinestraub — https://liftoff.energy.gov/wp-content/uploads/2023/05/20230523-Pathways-to-Commercial-Liftoff-Clean-Hydrogen.pdf
It is clearly visible on graphs at pages 9, 18 and 96. Python 3.11.0, my unstructured version is 0.12.2
@LesykDev - thanks for reporting this issue. If you'd like to open a PR with your suggested fix, we'd be happy to review. Otherwise we'll pick this up as soon as we're able.
Is your feature request related to a problem? Please describe. Sometimes when partitioning a pdf with plots and tables, the plot title is being cropped off by the bounding box, which leads to you losing important context for summarization LLM.
Describe the solution you'd like Add a bbox scale parameter to the partition function to increase/decrease the bounding box size.
Describe alternatives you've considered Don't know of any alternatives, other than maybe changing the detection model.
Additional context The change should be done in file:
unstructured\partition\pdf_image\pdf_image_utils.py
at line 183 (ver. 0.14.6). Example of the code that should fix the issue: