Open pprados opened 1 month ago
I’m experiencing the same issue. The issue arises in the OCRAgent
's get_instance
method as it expects all OCRAgent
s to have a language
parameter in their constructor. See below for clarity:
@staticmethod
@functools.lru_cache(maxsize=None)
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
module_name, class_name = ocr_agent_module.rsplit(".", 1)
if module_name not in OCR_AGENT_MODULES_WHITELIST:
raise ValueError(
f"Environment variable OCR_AGENT module name {module_name} must be set to a "
f"whitelisted module part of {OCR_AGENT_MODULES_WHITELIST}."
)
try:
module = importlib.import_module(module_name)
loaded_class = getattr(module, class_name)
return loaded_class(language) # <--- This is where the issue occurs
except (ImportError, AttributeError) as e:
logger.error(f"Failed to get OCRAgent instance: {e}")
raise RuntimeError(
"Could not get the OCRAgent instance. Please check the OCR package and the "
"OCR_AGENT environment variable."
)
However, the OCRAgentGoogleVision
class's constructor does not take in a language
parameter in its constructor. Thus, the exception of OCRAgentGoogleVision takes 1 positional argument but 2 were given
is thrown.
I'm willing to submit a PR to address this but want to know what the desired approach to solving this would be. Some possible options are:
OCRAgent
classes have a language parameter by defining it in the OCRAgent
abstract base class (ABC).
OCRAgentGoogleVision
: Add a language parameter to its constructor.ocr_interface.py
to check if language is a parameter in the constructor. If it is, call loaded_class(language), otherwise call loaded_class().Hi @DavidBlore, Thank you for your willingness to submit a PR to address this issue. After considering the options you've presented, I believe the most suitable approach would be: Modify OCRAgentGoogleVision: Add a language parameter to its constructor. This approach offers several advantages:
OCRAgentGoogleVision
with other OCR agents like OCRAgentTesseract
and OCRAgentPaddle
, which already have language parameters.Implementation suggestion:
class OCRAgentGoogleVision(OCRAgent):
def __init__(self, language='en'):
super().__init__()
self.language = language
# ... rest of the constructor
This change allows users to specify a language if needed, but defaults to English ('en') if not provided, similar to other OCR agents.
Next steps:
Describe the bug Try to parse a pdf with
OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision
.To Reproduce Provide a code snippet that reproduces the issue.
Expected behavior No error
Environment Info OS version: Linux-6.8.0-45-generic-x86_64-with-glibc2.39 Python version: 3.11.4 unstructured version: 0.15.14.dev1 unstructured-inference version: 0.7.36 pytesseract is not installed Torch version: 2.4.1 Detectron2 is not installed PaddleOCR version: None Libmagic version: file-5.45 magic file from /etc/magic:/usr/share/misc/magic
Additional context Add any other context about the problem here.