Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

fix: fix `IndexError` when partioning a pdf with `starting_page_number` #3246

Closed awalker4 closed 1 week ago

awalker4 commented 1 week ago

The Issue:

When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via starting_page_number. To get the true page index, we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements.

from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)