Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

Open udit-pandey-1 opened 1 month ago

udit-pandey-1 commented 1 month ago

Describe the bug I am getting the following error when extracting text and images from pdf: PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm' image

To Reproduce The way I am using unstructured is: image

Expected behavior Ideally, all the images in the pdf must be extracted. If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().

Environment Info image

Any kind of quickfix to get elements even in case of failure would also be appreciated.

MthwRobinson commented 1 month ago

Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client library, will split the PDF up and distribute across multiple workers and should give you faster processing times.

vegetableman commented 1 month ago

Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf Appreciate your efforts.

christinestraub commented 1 month ago

Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U
elements = partition(
    url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))
vegetableman commented 1 month ago

The latest versions worked for me :+1:... I was using the specific versions mentioned here: https://github.com/Unstructured-IO/unstructured/issues/2566#issuecomment-1982063333 Thank you, Christine!

However, partition_pdf does not support loading pdf files through a url paramter unless i am mistaken. Had to use the parameter filename.

christinestraub commented 1 month ago

Yes, as of now, partition_pdf does not support loading pdf files through a url parameter. Do we plan to do this? @MthwRobinson

MthwRobinson commented 1 month ago

We don't plan to add that in partition_pdf as of now, though I believe that works in partition and will detect the MIME type from the HTTP response.

vegetableman commented 1 month ago

@MthwRobinson that worked :+1: . My bad. Missed the module auto. Thank you!

udit-pandey-1 commented 1 month ago

@christinestraub the issue is still occurring for me after upgrading the mentioned packages.

We are seeing this issue on Ubuntu 20.04.

udit-pandey-1 commented 1 month ago

here is a reference pdf file for it: https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf

christinestraub commented 2 weeks ago

@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?

Environment:

unstructured==0.14.6
unstructured-inference==0.7.35

Code:

from unstructured.partition.auto import partition

elements = partition(
    url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)

print("\n\n".join([str(el) for el in elements]))
udit-pandey-1 commented 1 week ago

still the same @christinestraub

unstructured==0.14.6
unstructured-inference==0.7.36
image
christinestraub commented 1 week ago

@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?

udit-pandey-1 commented 1 week ago

libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.