Open AscendingGrass opened 5 months ago
@AscendingGrass the code producing the error message you were getting is actually here: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py#L38
class OCRAgent(ABC):
"""Defines the interface for an Optical Character Recognition (OCR) service."""
@classmethod
def get_agent(cls) -> OCRAgent:
"""Get the configured OCRAgent instance.
The OCR package used by the agent is determined by the `OCR_AGENT` environment variable.
"""
ocr_agent_cls_qname = cls._get_ocr_agent_cls_qname()
try:
return cls.get_instance(ocr_agent_cls_qname)
except (ImportError, AttributeError):
raise ValueError(
f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
f" not {ocr_agent_cls_qname}."
)
The wording starts out very similarly :)
I'm thinking the right solution for this case is to print the original error messages as well as the "Environment variable ..."
message.
I'm guessing the triggering error in your case was something like ImportError: cannot import package 'pandas'
or whatever, and something like that might have been just the pointer you would need in that situation :)
What do you think about that idea?
Implementation would be something like:
try:
return cls.get_instance(ocr_agent_cls_qname)
except (ImportError, AttributeError) as e:
raise ValueError(
f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
f" not {ocr_agent_cls_qname}: {str(e)}"
)
sounds good enough, no complaints here👍👍
In the
ocr_interface.py
file, it would be nice if the code handles theimportlib.import_module(module_name)
in theget_instance(...)
functionI was so confused when I keep getting this error from the
get_agent(...)
functionValueError: Environment variable OCR_AGENT must be set to an existing OCR agent module, not unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract.
when after hours of digging it turns out I just haven't installed pandas lol🗿