Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

feat/Add page range to partition functions #3231

Open ChiNoel-osu opened 1 week ago

ChiNoel-osu commented 1 week ago

Is your feature request related to a problem? Please describe. We could definitely filter out the pages we don't need after partitioning. But if we only need a few pages out of a hundred page PDF, having an option like this will save some time as it doesn't need to process any other pages, especially when using hi_res strategy.

Describe the solution you'd like Have partition functions accept page range parameters like start_at_page and end_at_page etc. Or accept a list of page numbers so people will have more flexibility.

Describe alternatives you've considered Nope.

Additional context Nope.