Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

bug/<ocr-agent> call PartitionPdf error: no ocr_agent found #3202

Open ZephryLiang opened 2 weeks ago

ZephryLiang commented 2 weeks ago

Describe the bug my code: os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf(file=f, ocr_agent=ocr_agent,strategy='ocr_only') error : Environment variable OCR_AGENT must be set to an existing OCR agent module, not unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle.

Expected behavior i want to extract elements from pdf, how can do this?

ZephryLiang commented 2 weeks ago

error message in the source code :
def get_agent(cls) -> OCRAgent: """Get the configured OCRAgent instance.

    The OCR package used by the agent is determined by the `OCR_AGENT` environment variable.
    """
    ocr_agent_cls_qname = cls._get_ocr_agent_cls_qname()
    try:
        return cls.get_instance(ocr_agent_cls_qname)
    except (ImportError, AttributeError):
        raise ValueError(
            f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
            f" not {ocr_agent_cls_qname}."
        )

what agent can i use? please!

MthwRobinson commented 2 weeks ago

Closing in favor of #3187. Looks like the same issue.

christinestraub commented 2 weeks ago

Hi @LiangZeFenglzf, You need to install additional dependencies to use PaddleOCR. You can use the following shell script to use those dependencies:

#!/usr/bin/env bash

# aarch64 requires a custom build of paddlepaddle
if [ "${ARCH}" = "aarch64" ]; then
  python3 -m pip install unstructured.paddlepaddle
else
  python3 -m pip install paddlepaddle
fi
python3 -m pip install unstructured.paddleocr

Also, you don't need to pass the ocr_agent param, so

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"
elements = partition_pdf(file=f, strategy='ocr_only')
ZephryLiang commented 1 week ago

Hi @LiangZeFenglzf, You need to install additional dependencies to use PaddleOCR. You can use the following shell script to use those dependencies:

#!/usr/bin/env bash

# aarch64 requires a custom build of paddlepaddle
if [ "${ARCH}" = "aarch64" ]; then
  python3 -m pip install unstructured.paddlepaddle
else
  python3 -m pip install paddlepaddle
fi
python3 -m pip install unstructured.paddleocr

Also, you don't need to pass the ocr_agent param, so

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"
elements = partition_pdf(file=f, strategy='ocr_only')

it doesn't work.pip list found: paddlepaddle 2.6.1

christinestraub commented 1 week ago

@ZephryLiang, please mention how you installed unstructured and what versions of libraries (unstructured, unstructured-inference) and OSS you're on (Linux, macOS).

liuxu4567 commented 2 days ago

I'm having the same issue with the latest container version I'm using,the error stack is "ImportError: /home/notebook-user/.local/lib/python3.11/site-packages/paddle/fluid/libpaddle.so: cannot open shared object file: No such file or directory" , but the libpaddle.so file is exists