Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.02k stars 742 forks source link

bug/OCRAgentGoogleVision takes 1 positional argument but 2 were given #3659

Open pprados opened 1 month ago

pprados commented 1 month ago

Describe the bug Try to parse a pdf with OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision.

To Reproduce Provide a code snippet that reproduces the issue.

import os

os.environ[
    "OCR_AGENT"] = "unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision"

from unstructured.partition.pdf import partition_pdf

partition_pdf("fake-memo.pdf",
              strategy="hi_res",
              )

Expected behavior No error

Environment Info OS version: Linux-6.8.0-45-generic-x86_64-with-glibc2.39 Python version: 3.11.4 unstructured version: 0.15.14.dev1 unstructured-inference version: 0.7.36 pytesseract is not installed Torch version: 2.4.1 Detectron2 is not installed PaddleOCR version: None Libmagic version: file-5.45 magic file from /etc/magic:/usr/share/misc/magic

Additional context Add any other context about the problem here.

DavidBlore commented 1 month ago

I’m experiencing the same issue. The issue arises in the OCRAgent's get_instance method as it expects all OCRAgents to have a language parameter in their constructor. See below for clarity:

@staticmethod
@functools.lru_cache(maxsize=None)
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
    module_name, class_name = ocr_agent_module.rsplit(".", 1)
    if module_name not in OCR_AGENT_MODULES_WHITELIST:
        raise ValueError(
            f"Environment variable OCR_AGENT module name {module_name} must be set to a "
            f"whitelisted module part of {OCR_AGENT_MODULES_WHITELIST}."
        )

    try:
        module = importlib.import_module(module_name)
        loaded_class = getattr(module, class_name)
        return loaded_class(language) # <--- This is where the issue occurs
    except (ImportError, AttributeError) as e:
        logger.error(f"Failed to get OCRAgent instance: {e}")
        raise RuntimeError(
            "Could not get the OCRAgent instance. Please check the OCR package and the "
            "OCR_AGENT environment variable."
        )

However, the OCRAgentGoogleVision class's constructor does not take in a language parameter in its constructor. Thus, the exception of OCRAgentGoogleVision takes 1 positional argument but 2 were given is thrown.

I'm willing to submit a PR to address this but want to know what the desired approach to solving this would be. Some possible options are:

christinestraub commented 1 month ago

Hi @DavidBlore, Thank you for your willingness to submit a PR to address this issue. After considering the options you've presented, I believe the most suitable approach would be: Modify OCRAgentGoogleVision: Add a language parameter to its constructor. This approach offers several advantages:

Implementation suggestion:

class OCRAgentGoogleVision(OCRAgent):
    def __init__(self, language='en'):
        super().__init__()
        self.language = language
        # ... rest of the constructor

This change allows users to specify a language if needed, but defaults to English ('en') if not provided, similar to other OCR agents.

Next steps: