Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

Minor ocr_interface.py Error Handling Improvement #3276

Open AscendingGrass opened 1 week ago

AscendingGrass commented 1 week ago

In the ocr_interface.py file, it would be nice if the code handles the importlib.import_module(module_name) in the get_instance(...) function

@staticmethod
@functools.lru_cache(maxsize=None)
def get_instance(ocr_agent_module: str) -> "OCRAgent":
    module_name, class_name = ocr_agent_module.rsplit(".", 1)
    if module_name in OCR_AGENT_MODULES_WHITELIST:
        module = importlib.import_module(module_name)
        loaded_class = getattr(module, class_name)
        return loaded_class()
    else:
        raise ValueError(
            f"Environment variable OCR_AGENT module name {module_name}, must be set to a"
            f" whitelisted module part of {OCR_AGENT_MODULES_WHITELIST}.",
        )

I was so confused when I keep getting this error from the get_agent(...) function

ValueError: Environment variable OCR_AGENT must be set to an existing OCR agent module, not unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract.

when after hours of digging it turns out I just haven't installed pandas lol🗿

scanny commented 3 days ago

@AscendingGrass the code producing the error message you were getting is actually here: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py#L38

class OCRAgent(ABC):
    """Defines the interface for an Optical Character Recognition (OCR) service."""

    @classmethod
    def get_agent(cls) -> OCRAgent:
        """Get the configured OCRAgent instance.

        The OCR package used by the agent is determined by the `OCR_AGENT` environment variable.
        """
        ocr_agent_cls_qname = cls._get_ocr_agent_cls_qname()
        try:
            return cls.get_instance(ocr_agent_cls_qname)
        except (ImportError, AttributeError):
            raise ValueError(
                f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
                f" not {ocr_agent_cls_qname}."
            )

The wording starts out very similarly :)

I'm thinking the right solution for this case is to print the original error messages as well as the "Environment variable ..." message.

I'm guessing the triggering error in your case was something like ImportError: cannot import package 'pandas' or whatever, and something like that might have been just the pointer you would need in that situation :)

What do you think about that idea?

Implementation would be something like:

        try:
            return cls.get_instance(ocr_agent_cls_qname)
        except (ImportError, AttributeError) as e:
            raise ValueError(
                f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
                f" not {ocr_agent_cls_qname}: {str(e)}"
            )