Out of memory when running OCR for a lot of images

tinganle commented 4 years ago

java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (665M) > maxPhysicalBytes (600M) at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:589) at org.bytedeco.javacpp.Pointer.init(Pointer.java:125) at org.bytedeco.tesseract.TessBaseAPI.allocate(Native Method) at org.bytedeco.tesseract.TessBaseAPI.(TessBaseAPI.java:35)

If I set the heap size bigger, it will run into this error eventually. We follow the basic example. We create a new instance of BytedecoOcrAPI and call init() for each 'document' which consists of multiple image files that we call doOcr() for each image file.

public class BytedecoOcrAPI implements OcrAPI {

private TessBaseAPI api;
private String dataPath;

public BytedecoOcrAPI(String dataPath)  {
    this.api = new TessBaseAPI();
    this.dataPath = dataPath;
}

public void init() throws OcrException {
    if (api.Init(dataPath, "eng", 0) != 0) {
        throw new OcrException("failed to read tesseract data file from ");
    }
    api.SetVariable("tessedit_char_blacklist", "ﬁﬂﬀﬃﬄﬅ");
    api.SetVariable("hocr_font_info", "0");
    api.SetPageSegMode(1);
}

public String doOcr(File file) throws OcrException {
    return doOcr(file.getPath());
}

public String doOcr(String filePathName) throws OcrException {
    PIX image = pixRead(filePathName);
    api.SetImage(image);

    BytePointer output = api.GetHOCRText(0);

    try {
        return output.getString("UTF-8");
    } catch (IOException e) {
        throw new OcrException(e);
    } finally {
        output.deallocate();
        pixDestroy(image);
    }
}

public void close() {
    api.End();
}

}

saudet commented 4 years ago

600 MB isn't a lot of memory. You'll probably need to increase that.

saudet commented 4 years ago

Just to be sure, add a call to api.deallocate() right after api.End(). Let me know if that doesn't fix anything though.

tinganle commented 4 years ago

600 MB isn't a lot of memory. You'll probably need to increase that.

Yes. 600MB was because I tried to limit heap size to 300M on my local machine to reproduce the problem faster. When we ran the same process on Linux with more memory, we saw it threw the same error around 8G.

Also we we monitored the Java Heap size and the JVM process memory size, there's a huge difference. On my local with heap size max to 300M, the JAVA heap size stayed below 250M, but the JVM process could use 2G memory. We do parallel OCR processing, for each thread, there's only one image file being OCR-ed at a given time. If the memory is cleaned up properly, ideally the memory usage shouldn't grow.

I will try api.deallocate(). Any other ideas are much appreciated. Thanks!

tinganle commented 4 years ago

Hi @saudet ,

When I debug, both output and image have null deallocator at the following statements. Calling output.deallocate() doesn't seem doing anything. Is this the desired behavior?

    output.deallocate();
    pixDestroy(image);

Thanks.

saudet commented 4 years ago

Yes, those are just pointers returned from native functions, so JavaCPP doesn't know how to deallocate them.

tinganle commented 4 years ago

Update: calling TessDeleteText(output) after each OCR greatly helped the memory issue (not fully resolved yet).

bytedeco / javacpp-presets

Out of memory when running OCR for a lot of images #836