Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.13k stars 752 forks source link

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

Open udit-pandey-1 opened 5 months ago

udit-pandey-1 commented 5 months ago

Describe the bug I am getting the following error when extracting text and images from pdf: PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm' image

To Reproduce The way I am using unstructured is: image

Expected behavior Ideally, all the images in the pdf must be extracted. If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().

Environment Info image

Any kind of quickfix to get elements even in case of failure would also be appreciated.

MthwRobinson commented 5 months ago

Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client library, will split the PDF up and distribute across multiple workers and should give you faster processing times.

vegetableman commented 5 months ago

Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf Appreciate your efforts.

christinestraub commented 5 months ago

Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U
elements = partition(
    url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))
vegetableman commented 5 months ago

The latest versions worked for me :+1:... I was using the specific versions mentioned here: https://github.com/Unstructured-IO/unstructured/issues/2566#issuecomment-1982063333 Thank you, Christine!

However, partition_pdf does not support loading pdf files through a url paramter unless i am mistaken. Had to use the parameter filename.

christinestraub commented 5 months ago

Yes, as of now, partition_pdf does not support loading pdf files through a url parameter. Do we plan to do this? @MthwRobinson

MthwRobinson commented 5 months ago

We don't plan to add that in partition_pdf as of now, though I believe that works in partition and will detect the MIME type from the HTTP response.

vegetableman commented 5 months ago

@MthwRobinson that worked :+1: . My bad. Missed the module auto. Thank you!

udit-pandey-1 commented 5 months ago

@christinestraub the issue is still occurring for me after upgrading the mentioned packages.

We are seeing this issue on Ubuntu 20.04.

udit-pandey-1 commented 5 months ago

here is a reference pdf file for it: https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf

christinestraub commented 4 months ago

@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?

Environment:

unstructured==0.14.6
unstructured-inference==0.7.35

Code:

from unstructured.partition.auto import partition

elements = partition(
    url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)

print("\n\n".join([str(el) for el in elements]))
udit-pandey-1 commented 4 months ago

still the same @christinestraub

unstructured==0.14.6
unstructured-inference==0.7.36
image
christinestraub commented 4 months ago

@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?

udit-pandey-1 commented 4 months ago

libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.

sanyamjain0315 commented 3 months ago

Has there been a progress in this issue? I am facing the same problem, even after having tried everything.

tpakeman commented 2 months ago

Hi there I'm having the same issue: Python 3.10.12

unstructured                     0.14.6
unstructured-client              0.25.6
unstructured-inference           0.7.35
unstructured.pytesseract         0.3.13

Unfortunately I can't share the documents as they contain proprietary information.

This is happening for every PDF in a folder of 50. All were generated from HTML files by downloading with Chrome and saving with PDF.

Stacktrace:

---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
[<ipython-input-21-a26b75af5795>](https://localhost:8080/#) in <cell line: 4>()
      4 for k in data.keys():
      5   fpath = f"/path/to/file/{k}"
----> 6   els = partition_pdf(filename=fpath, 
      7                       max_partition=1500,
      8                       chunking_strategy='by_title',

10 frames
[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    603             unique_element_ids: bool = call_args.get("unique_element_ids", False)
    604             if unique_element_ids is False:
--> 605                 elements = assign_and_map_hash_ids(elements)
    606 
    607             return elements

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
     72 
     73         # -- call the partitioning function to get the elements --
---> 74         elements = func(*args, **kwargs)
     75 
     76         # -- look for a chunking-strategy argument --

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    208         form_extraction_skip_tables=form_extraction_skip_tables,
    209         **kwargs,
--> 210     )
    211 
    212 

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    344     if isinstance(file, bytes):
    345         file = io.BytesIO(file)
--> 346     return _partition_pdf_with_pdfminer(
    347         filename=filename,
    348         file=file,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, ocr_languages, is_image, metadata_last_modified, starting_page_number, **kwargs)
    894             tmp_element = element
    895             tmp_text = element.text
--> 896             tmp_coords = element.metadata.coordinates
    897         elif tmp_element and check_coords_within_boundary(
    898             coordinates=element.metadata.coordinates,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf_image/pdf_image_utils.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
    414     date_from_file_object: bool = False,
    415 ) -> str | None:
--> 416     last_modification_date = None
    417     if not file and filename:
    418         last_modification_date = get_last_modified_date(filename=filename)

[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    267                 )
    268             else:
--> 269                 images += parse_buffer_func(data)
    270     finally:
    271         if auto_temp_dir:

[/usr/local/lib/python3.10/dist-packages/pdf2image/parsers.py](https://localhost:8080/#) in parse_buffer_to_ppm(data)
     26         size_x, size_y = tuple(size.split(b" "))
     27         file_size = len(code) + len(size) + len(rgb) + 3 + int(size_x) * int(size_y) * 3
---> 28         images.append(Image.open(BytesIO(data[index : index + file_size])))
     29         index += file_size
     30 

[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
   3281             raise TypeError(msg) from e
   3282     else:
-> 3283         rawmode = mode
   3284     if mode in ["1", "L", "I", "P", "F"]:
   3285         ndmax = 2

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7e086492d030>