Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

feat/extract_pdf_page_images #3299

Open huanji1987 opened 3 days ago

huanji1987 commented 3 days ago

Is your feature request related to a problem? Please describe. Currently I'm working on a project that makes use of partition_pdf with hi_res strategy. Along with this the project also requires extracting each page of the pdf as an image. I see that here in the code that partition_pdf with hi_res will eventually hit, the pdf images per page is already being extracted with pdf2image. Instead of extracting the page images separately it would be ideal to be able to make use of these temporary images that are discarded after the with block.

Describe the solution you'd like Ideally the partition_pdf function would have an option to extract_pdf_page_images. When this option is True, instead of using tempfile.TemporaryDirectory() to create a temporary directory for the images, the images would be returned in the response in some way to be available for use.

Describe alternatives you've considered Alternatively I could look do the following:

  1. Do this separately and just eat the double work. Unfortunately pdf2image can be quite slow
  2. Monkeypatch the code, this is a good temporary fix but would likely require locking the version of unstructured used and is not a viable long term strategy
  3. Branch unstructured code and implement a fix for my use, similar to monkey patching not a viable long term strategy.
  4. Use another library that's not pdf2image that is faster so double work is no big deal, this has been explored and is not viable for various reasons.

Additional context I want to add that I am happy to create a pull request myself for this feature. Mostly just curious about people's thoughts on this and thoughts on the right approach for this if I were to create a PR.