Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.33k stars 567 forks source link

bug/unstructured.paddleocr is not compatible with GPU version of PaddleOCR #3191

Open peixin-lin opened 2 weeks ago

peixin-lin commented 2 weeks ago

I have got the following error when setting the OCR agent to Paddle and loading a GPU model.

     | During handling of the above exception, another exception occurred:

    | 

    |     if not paddle.fluid.core.is_compiled_with_rocm():

    |   File "/usr/local/lib/python3.9/site-packages/unstructured_paddleocr/paddle_tools/infer/utility.py", line 314, in get_infer_gpuid

    | AttributeError: module 'paddle' has no attribute 'fluid'

    |     return cls.get_instance(ocr_agent_cls_qname)

    |   File "/usr/local/lib/python3.9/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance

    |   File "/usr/local/lib/python3.9/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 35, in get_agent

which finally leads to the following error:

    |   File "/usr/local/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 802, in _partition_pdf_or_image_with_ocr_from_image

    |     ocr_agent = OCRAgent.get_agent()

    |     page_elements = _partition_pdf_or_image_with_ocr_from_image(

    +------------------------------------

    | ValueError: Environment variable OCR_AGENT must be set to an existing OCR agent module, not unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle.

I think the problem could be possibly solved by changing the line if not paddle.fluid.core.is_compiled_with_rocm(): to if not paddle.core.is_compiled_with_rocm():. (line 314 in unstructured_paddleocr/paddle_tools/infer/utility.py)

My dependencies:

unstructured             0.14.5
unstructured-client      0.23.3
unstructured-inference   0.7.34
unstructured.paddleocr   2.6.1.3
unstructured.pytesseract 0.3.12
paddleclas               2.5.2
paddleocr                2.7.3
paddlepaddle             2.6.1
paddlepaddle-gpu         2.6.1.post112
MthwRobinson commented 2 weeks ago

Hi @peixin-lin - thanks for the report. We don't plan to support GPUs in the open source, but if we add GPU support to our SaaS products we may address this in unstructured.paddleocr then.