Open udit-pandey-1 opened 5 months ago
Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client
library, will split the PDF up and distribute across multiple workers and should give you faster processing times.
Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf Appreciate your efforts.
Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.
$ pip install unstructured -U
$ pip install unstructured-inference -U
elements = partition(
url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
include_page_breaks=True,
extract_image_block_types=["Image", "Table"],
extract_image_block_to_payload=True,
skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))
The latest versions worked for me :+1:... I was using the specific versions mentioned here: https://github.com/Unstructured-IO/unstructured/issues/2566#issuecomment-1982063333 Thank you, Christine!
However, partition_pdf
does not support loading pdf files through a url
paramter unless i am mistaken. Had to use the parameter filename
.
Yes, as of now, partition_pdf
does not support loading pdf files through a url
parameter. Do we plan to do this? @MthwRobinson
We don't plan to add that in partition_pdf
as of now, though I believe that works in partition
and will detect the MIME type from the HTTP response.
@MthwRobinson that worked :+1: . My bad. Missed the module auto
. Thank you!
@christinestraub the issue is still occurring for me after upgrading the mentioned packages.
We are seeing this issue on Ubuntu 20.04.
here is a reference pdf file for it: https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf
@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?
Environment:
unstructured==0.14.6
unstructured-inference==0.7.35
Code:
from unstructured.partition.auto import partition
elements = partition(
url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
include_page_breaks=True,
extract_image_block_types=["Image", "Table"],
extract_image_block_to_payload=True,
skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))
still the same @christinestraub
unstructured==0.14.6
unstructured-inference==0.7.36
@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?
libmagic-dev
(filetype detection)poppler-utils
(images and PDfs)libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.
Has there been a progress in this issue? I am facing the same problem, even after having tried everything.
Hi there I'm having the same issue:
Python 3.10.12
unstructured 0.14.6
unstructured-client 0.25.6
unstructured-inference 0.7.35
unstructured.pytesseract 0.3.13
Unfortunately I can't share the documents as they contain proprietary information.
This is happening for every PDF in a folder of 50. All were generated from HTML files by downloading with Chrome and saving with PDF.
Stacktrace:
---------------------------------------------------------------------------
UnidentifiedImageError Traceback (most recent call last)
[<ipython-input-21-a26b75af5795>](https://localhost:8080/#) in <cell line: 4>()
4 for k in data.keys():
5 fpath = f"/path/to/file/{k}"
----> 6 els = partition_pdf(filename=fpath,
7 max_partition=1500,
8 chunking_strategy='by_title',
10 frames
[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
603 unique_element_ids: bool = call_args.get("unique_element_ids", False)
604 if unique_element_ids is False:
--> 605 elements = assign_and_map_hash_ids(elements)
606
607 return elements
/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)
[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
72
73 # -- call the partitioning function to get the elements --
---> 74 elements = func(*args, **kwargs)
75
76 # -- look for a chunking-strategy argument --
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
208 form_extraction_skip_tables=form_extraction_skip_tables,
209 **kwargs,
--> 210 )
211
212
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
344 if isinstance(file, bytes):
345 file = io.BytesIO(file)
--> 346 return _partition_pdf_with_pdfminer(
347 filename=filename,
348 file=file,
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, ocr_languages, is_image, metadata_last_modified, starting_page_number, **kwargs)
894 tmp_element = element
895 tmp_text = element.text
--> 896 tmp_coords = element.metadata.coordinates
897 elif tmp_element and check_coords_within_boundary(
898 coordinates=element.metadata.coordinates,
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf_image/pdf_image_utils.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
414 date_from_file_object: bool = False,
415 ) -> str | None:
--> 416 last_modification_date = None
417 if not file and filename:
418 last_modification_date = get_last_modified_date(filename=filename)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
267 )
268 else:
--> 269 images += parse_buffer_func(data)
270 finally:
271 if auto_temp_dir:
[/usr/local/lib/python3.10/dist-packages/pdf2image/parsers.py](https://localhost:8080/#) in parse_buffer_to_ppm(data)
26 size_x, size_y = tuple(size.split(b" "))
27 file_size = len(code) + len(size) + len(rgb) + 3 + int(size_x) * int(size_y) * 3
---> 28 images.append(Image.open(BytesIO(data[index : index + file_size])))
29 index += file_size
30
[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
3281 raise TypeError(msg) from e
3282 else:
-> 3283 rawmode = mode
3284 if mode in ["1", "L", "I", "P", "F"]:
3285 ndmax = 2
UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7e086492d030>
Describe the bug I am getting the following error when extracting text and images from pdf:
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm'
To Reproduce The way I am using unstructured is:
Expected behavior Ideally, all the images in the pdf must be extracted. If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().
Environment Info
Any kind of quickfix to get elements even in case of failure would also be appreciated.