RapidAI / RapidOCR

Awesome OCR multiple programing languages toolkits based on ONNXRuntime, OpenVION and PaddlePaddle.
https://rapidai.github.io/RapidOCRDocs
Apache License 2.0
2.91k stars 356 forks source link

RapidOCR Error - Leaked Semaphore Objects & OOM Killer #231

Open BennisonDevadoss opened 5 days ago

BennisonDevadoss commented 5 days ago

Problem Description:

While processing a large number of images (approximately 1000) using RapidOCR, I encountered the following errors midway through the process:

  1. Leaked Semaphore Objects: "There appear to be 1 leaked semaphore object(s) to clean up at shutdown."
  2. Process Killed by OOM Killer: "The process of this unit has been killed by the OOM killer."

System Information:

Reproducible Code:

from typing import Sequence, Union, Iterable
import numpy as np

def extract_from_images_with_rapidocr(
    images: Sequence[Union[Iterable[np.ndarray], bytes]],
) -> str:
    try:
        from rapidocr_onnxruntime import RapidOCR
    except ImportError:
        raise ImportError(
            "`rapidocr-onnxruntime` package not found, please install it with "
            "`pip install rapidocr-onnxruntime`"
        )
    ocr = RapidOCR()
    text = ""
    for img in images:
        result, _ = ocr(img)
        if result:
            result = [text[1] for text in result]
            text += "\n".join(result)
    return text

Research & Findings:

These errors seem to be related to memory leaks during batch image processing. I am uncertain about how to resolve these issues within RapidOCR, especially when handling large numbers of images.

Additional Questions:

  1. Are there any memory management techniques or best practices for handling large image batches in RapidOCR?
  2. How can I optimize memory usage to prevent OOM killer termination?
  3. Is there a way to monitor memory consumption or manage semaphore objects during the process?
  4. Would changing the version of RapidOCR (upgrading/downgrading) help resolve this memory-related issue?

Any guidance or solutions would be greatly appreciated!

SWHL commented 5 days ago

I guess that some of the 1000 images are large in size, which causes the memory request to exceed the limit when recognizing these images. At present, it is recommended to check the images sent for recognition to see if there are any images with particularly large sizes, such as 4000x7000. It is recommended to resize them in advance before sending them for OCR recognition.

Later, I will add this logic in the code to control the memory from exceeding the limit.

BennisonDevadoss commented 4 days ago

@SWHL, Thank you for the response! I have couple of follow-up questions based on your suggestions:

  1. What would be the recommended target resolution for images to prevent memory overload during OCR processing? Is there an optimal balance between image size and OCR accuracy?
  2. Could you share more details about the memory control logic you plan to add? Will this logic automatically resize or manage large images, and will it be included in a future release of RapidOCR?
SWHL commented 4 days ago

These two points are already under development, please refer to the develop branch, and they will be updated to the new version soon.

SWHL commented 3 days ago

You can try it again with the rapidocr_onnxruntime==1.3.25