Open udit-pandey-1 opened 1 month ago
Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client
library, will split the PDF up and distribute across multiple workers and should give you faster processing times.
Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf Appreciate your efforts.
Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.
$ pip install unstructured -U
$ pip install unstructured-inference -U
elements = partition(
url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
include_page_breaks=True,
extract_image_block_types=["Image", "Table"],
extract_image_block_to_payload=True,
skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))
The latest versions worked for me :+1:... I was using the specific versions mentioned here: https://github.com/Unstructured-IO/unstructured/issues/2566#issuecomment-1982063333 Thank you, Christine!
However, partition_pdf
does not support loading pdf files through a url
paramter unless i am mistaken. Had to use the parameter filename
.
Yes, as of now, partition_pdf
does not support loading pdf files through a url
parameter. Do we plan to do this? @MthwRobinson
We don't plan to add that in partition_pdf
as of now, though I believe that works in partition
and will detect the MIME type from the HTTP response.
@MthwRobinson that worked :+1: . My bad. Missed the module auto
. Thank you!
@christinestraub the issue is still occurring for me after upgrading the mentioned packages.
We are seeing this issue on Ubuntu 20.04.
here is a reference pdf file for it: https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf
@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?
Environment:
unstructured==0.14.6
unstructured-inference==0.7.35
Code:
from unstructured.partition.auto import partition
elements = partition(
url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
include_page_breaks=True,
extract_image_block_types=["Image", "Table"],
extract_image_block_to_payload=True,
skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))
still the same @christinestraub
unstructured==0.14.6
unstructured-inference==0.7.36
@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?
libmagic-dev
(filetype detection)poppler-utils
(images and PDfs)libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.
Describe the bug I am getting the following error when extracting text and images from pdf:![image](https://github.com/Unstructured-IO/unstructured/assets/160473849/fdcd5219-f59e-43dd-911d-f40b5d9199f2)
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm'
To Reproduce The way I am using unstructured is:![image](https://github.com/Unstructured-IO/unstructured-inference/assets/160473849/a943961e-0f59-4982-b1fd-e8e95822c5af)
Expected behavior Ideally, all the images in the pdf must be extracted. If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().
Environment Info![image](https://github.com/Unstructured-IO/unstructured/assets/160473849/aa15b030-209d-4762-9cf6-fe21616509bb)
Any kind of quickfix to get elements even in case of failure would also be appreciated.