Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.33k stars 567 forks source link

bug/language specification does not work for PaddleOCR agent #3159

Open peixin-lin opened 2 weeks ago

peixin-lin commented 2 weeks ago

I specified the languages parameter with the value ["chi", "eng"] but it did not work. When I upload a Chinese pdf document, Unstructured still loads a English model. I checkout the source code and found these lines in the path unstructured\partition\utils\ocr_models\paddle_ocr.py, where the init function receives no argument for specifying language:

class OCRAgentPaddle(OCRAgent):
    """OCR service implementation for PaddleOCR."""

    def __init__(self):
        self.agent = self.load_agent()

    def load_agent(self, language: str = DEFAULT_PADDLE_LANG):
        """Loads the PaddleOCR agent as a global variable to ensure that we only load it once."""

        import paddle
        from unstructured_paddleocr import PaddleOCR

        # Disable signal handlers at C++ level upon failing
        # ref: https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/
        #      disable_signal_handler_en.html#disable-signal-handler
        paddle.disable_signal_handler()
        # Use paddlepaddle-gpu if there is gpu device available
        gpu_available = paddle.device.cuda.device_count() > 0
        if gpu_available:
            logger.info(f"Loading paddle with GPU on language={language}...")
        else:
            logger.info(f"Loading paddle with CPU on language={language}...")
        try:
            # Enable MKL-DNN for paddle to speed up OCR if OS supports it
            # ref: https://paddle-inference.readthedocs.io/en/master/
            #      api_reference/cxx_api_doc/Config/CPUConfig.html
            paddle_ocr = PaddleOCR(
                use_angle_cls=True,
                use_gpu=gpu_available,
                lang=language,
                enable_mkldnn=True,
                show_log=False,
            )
        except AttributeError:
            paddle_ocr = PaddleOCR(
                use_angle_cls=True,
                use_gpu=gpu_available,
                lang=language,
                enable_mkldnn=False,
                show_log=False,
            )
        return paddle_ocr

Is there a way to work around this?

MthwRobinson commented 2 weeks ago

Hi @peixin-lin - thanks for reporting. We'll take a look as soon as we're able.

@christinestraub - This would be a good one to look at once you free up.

peixin-lin commented 2 weeks ago

Hi @peixin-lin - thanks for reporting. We'll take a look as soon as we're able.

@christinestraub - This would be a good one to look at once you free up.

I found out that by setting the environment variable DEFAULT_PADDLE_LANG to "ch" works at the moment.