Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.94k stars 733 forks source link

UnidentifiedImageError: cannot identify image file '/tmp/tmpa3o9dj66/b5d7995b-82db-4257-bdcb-20795a00c72b-01.ppm' #3474

Open SaleemMalik632 opened 2 months ago

SaleemMalik632 commented 2 months ago

I Have the Clear pdf with proper images but this give

from unstructured.partition.pdf import partition_pdf from PIL import UnidentifiedImageError

# Extract images, tables, and chunk text

raw_pdf_elements = partition_pdf( filename='/content/2023-conocophillips-aim-presentation.pdf', extract_images_in_pdf=True, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path='/content/', )

I lot of RND but not find any solution unstructured is not a good for pdf parser

ShkAmmarHussain commented 2 months ago

If you are running on colab or Jupyter restart the session and then try again.