Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.43k stars 580 forks source link

bug/pdf extraction error when strategy not set #3187

Closed pk-lit closed 1 day ago

pk-lit commented 3 weeks ago

def extract_text_by_page(pdf_path): """Extracts text from each page of a PDF file using unstructured.io.""" document = partition_pdf(pdf_path) pages_text = [page.page_content for page in document.pages] return pages_text

results in this error:

Traceback (most recent call last):
  File "/Users/pk/skynet/viso-pdf/unst.py", line 60, in <module>
    main(args.pdf, args.output, args.chunks)
  File "/Users/pk/skynet/viso-pdf/unst.py", line 35, in main
    extracted_pages = extract_text_by_page(pdf_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/viso-pdf/unst.py", line 13, in extract_text_by_page
    document = partition_pdf(pdf_path)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/documents/elements.py", line 591, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 618, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 192, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 321, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 776, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 802, in _partition_pdf_or_image_with_ocr_from_image
    ocr_agent = OCRAgent.get_agent()
                ^^^^^^^^^^^^^^^^^^^^
  File "/Users/pk/skynet/.venv/lib/python3.11/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 37, in get_agent
    raise ValueError(
ValueError: Environment variable OCR_AGENT must be set to an existing OCR agent module, not unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract.

this is solved by specifying a strategy as a param. good to add an error message to this effect versus having to go down a weird env var rabbit hole.

cheers!

MthwRobinson commented 3 weeks ago

Thanks for the report @pk-lit !

christinestraub commented 3 weeks ago

Hi @pk-lit, Are you using the latest versions of unstructured(0.14.5) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U