Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
8.94k
stars
733
forks
source link
UnidentifiedImageError: cannot identify image file '/tmp/tmpa3o9dj66/b5d7995b-82db-4257-bdcb-20795a00c72b-01.ppm' #3474
Open
SaleemMalik632 opened 2 months ago
I Have the Clear pdf with proper images but this give
from unstructured.partition.pdf import partition_pdf from PIL import UnidentifiedImageError
raw_pdf_elements = partition_pdf( filename='/content/2023-conocophillips-aim-presentation.pdf', extract_images_in_pdf=True, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path='/content/', )
I lot of RND but not find any solution unstructured is not a good for pdf parser