Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

bug/<pdfminer> #3121

Closed Tilemachoc closed 1 month ago

Tilemachoc commented 1 month ago

TEST CODE:

import langchain import os from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import elements_from_json

filename = "file.pdf"

elements = partition_pdf( filename=filename, strategy="hi_res", infer_table_structure=True, model_name="yolox" )

print(elements) for elem in elements: print("------") print(elem.metadata.text_as_html)


ERROR:

line 5, in import unstructured.partition.pdf ModuleNotFoundError: No module named 'pdfminer

scanny commented 1 month ago

@Tilemachoc You'll need to install PDF extras:

pip install unstructured[pdf]

https://docs.unstructured.io/open-source/installation/full-installation

Closing as assumed resolved, but feel free to reopen if you're still having trouble :)