OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
39 stars 10 forks source link

Memory leaks #105

Closed stweil closed 4 years ago

stweil commented 4 years ago

The memory usage of ocrd-tesserocr-segment-region increases for each page, resulting in a total of about 7 GB for 200 pages, 8 GB for 248 pages, 10 GB for 282 pages, 11 GB for 313 pages (observed for http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-22977).

ocrd-tesserocr-segment-line shows a similar effect.

For that book, a machine with 8 GB RAM would have started swapping, thus slowing down the process extremely. Even a large server would get memory problems when processing large books with more than 1000 pages in parallel.

stweil commented 4 years ago

ocrd-tesserocr-recognize also seems to leak memory with each page.

bertsky commented 4 years ago

Interesting! Have you seen anything like this on Tesseract CLI (e.g. when using file lists)?

We don't End() (i.e. deinitialize) the API between pages – which we could, but that would add the cost of loading models into memory each time, which clearly should not be necessary. (If it is, there's likely a bug in Tesseract.)

Can you try adding tessapi.Clear() between pages? (But AFAIK tesserocr already does pixDestroy() on the image data between successive SetImage() and free() on the text results between successive GetUTF8Text(). Also, it has __dealloc__() on all iterator classes, which del the iterator, as they should according to Tesseract's API documentation. But who knows, maybe in some corner of tesserocr this was forgotten.)

stweil commented 4 years ago

We had memory leaks for the Tesseract command line tools in the past, but fixed them. Tests with Valgrind and clang sanitizers show no remaining leaks for all test cases.

I already tried to find the leak in ocrd-tesserocr-recognize using Valgrind, but get lots of problem reports for Python code which seem to hide the problem I was searching for,

bertsky commented 4 years ago

We had memory leaks for the Tesseract command line tools in the past, but fixed them. Tests with Valgrind and clang sanitizers show no remaining leaks for all test cases.

And do these tests include (very) long input file lists?

Please try your setup with Clear() between pages!

stweil commented 4 years ago

And do these tests include (very) long input file lists?

They would find even small leaks of a few bytes, so unless there is a special mechanism for long input file lists it does not make a difference whether a test iterates about two or 2000 files.

bertsky commented 4 years ago

They would find even small leaks of a few bytes, so unless there is a special mechanism for long input file lists it does not make a difference whether a test iterates about two or 2000 files.

True, the number of files is irrelevant. But having a loop or no loop can make a difference – that's why I was asking for the "input file list" mode. Anyway, I justed did a simple test myself and cannot find an increase in memory (using memtime).

So we are looking for something in tesserocr...

bertsky commented 4 years ago

I can report that tessapi.Clear() between pages is not helpful as a workaround.

@stweil Can you please isolate this into a minimal example (without OCR-D dependencies) and report on sirfz/tesserocr?

stweil commented 4 years ago

There is now a first usable result from Valgrind from running ocrd-tesserocr-segment-line on a small set of five pages:

==261362== 672 bytes in 40 blocks are definitely lost in loss record 730 of 1,097
==261362==    at 0x483650F: operator new[](unsigned long) (vg_replace_malloc.c:433)
==261362==    by 0x57CBC10: tesseract::TessBaseAPI::GetComponentImages(tesseract::PageIteratorLevel, bool, bool, int, Pixa**, int**, int**) (baseapi.cpp:736)
==261362==    by 0x57A1B5F: __pyx_pf_9tesserocr_13PyTessBaseAPI_70GetComponentImages (tesserocr.cpp:20199)
==261362==    by 0x57A1B5F: __pyx_pw_9tesserocr_13PyTessBaseAPI_71GetComponentImages(_object*, _object*, _object*) (tesserocr.cpp:20084)
==261362==    by 0x5C9B94: _PyMethodDef_RawFastCallKeywords (in /usr/bin/python3.7)

I'll investigate that further.

stweil commented 4 years ago

The implementation in tesserocr is buggy here for the last two arguments: it passes the addresses of _blockids and _paraids, so those arguments are never NULL, and memory will be allocated.

I'll test this and create a pull request for tesserocr, but not write an extra issue there.

stweil commented 4 years ago

Here are first test results with a fixed GetComponentImages.

Before fix:

==261362== LEAK SUMMARY:
==261362==    definitely lost: 672 bytes in 40 blocks
==261362==    indirectly lost: 0 bytes in 0 blocks
==261362==      possibly lost: 142,899 bytes in 80 blocks

After fix:

==262539== LEAK SUMMARY:
==262539==    definitely lost: 0 bytes in 0 blocks
==262539==    indirectly lost: 0 bytes in 0 blocks
==262539==      possibly lost: 142,899 bytes in 80 blocks
bertsky commented 4 years ago

The implementation in tesserocr is buggy here for the last two arguments: it passes the addresses of _blockids and _paraids, so those arguments are never NULL, and memory will be allocated.

Looks correct to me, though. They are deallocated afterwards (although not via delete[], as documented in Tesseract's baseapi.h, but free). Or are you referring to the boxa == NULL case?

stweil commented 4 years ago

They are only deallocated for True case, but not for the more common False case.

The bad news is that there seem to be more leaks.

bertsky commented 4 years ago

but not for the more common False case.

But in the False case, NULL is passed, so they don't get alloc'd in libtesseract anyway. (Also, this is the less common case, being non-default.)

Have you tried replacing free with delete[]?

stweil commented 4 years ago

No, the address of null pointers is passed, and the address of a local variable is never NULL. That's the bug I mentioned.

bertsky commented 4 years ago

No, the address of null pointers is passed, and the address of a local variable is never NULL. That's the bug I mentioned.

Oh, of course! It's so easy to miss. The & is wrong.

stweil commented 4 years ago

There remains definitely still a major leak, as memory usage increases from less than 1 GB to more than 10 GB for about 300 pages.

stweil commented 4 years ago

I can now reproduce the leak when the loop over the images only runs until tessapi.SetImage(page_image). I see no leak if the loop continues before that statement.

There is no leak when a similar loop is run in tesserocr.

Maybe the page_imagereturned by self.workspace.image_from_page is never released as soon as it was used at least once?

bertsky commented 4 years ago

I can now reproduce the leak when the loop over the images only runs until tessapi.SetImage(page_image)

What processor are you referring to? (Your previous comments related to segment_line, which uses region images.)

bertsky commented 4 years ago

Maybe the page_imagereturned by self.workspace.image_from_page is never released as soon as it was used at least once?

What do you mean by released? The PIL.Image object should go out of scope between pages/iterations, so the runtime GC should be able to pick it up. I don't see anyone holding a reference to it which does not in turn go out of scope.

stweil commented 4 years ago

What processor are you referring to?

The latest tests were made with ocrd-tesserocr-segment-region and a small code modification:

--- ocrd_tesserocr/segment_region.py    2020-01-09 16:34:39.398664316 +0100
+++ venv/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py   2020-01-12 16:42:16.188062969 +0100
@@ -132,7 +132,13 @@
                     tessapi.SetVariable('user_defined_dpi', str(dpi))

                 LOG.info("Detecting regions in page '%s'", page_id)
+
                 tessapi.SetImage(page_image) # is already cropped to Border
+                ###
+                from time import sleep
+                sleep(0.1)
+                continue
+
                 tessapi.SetPageSegMode(PSM.AUTO) # (default)

                 # detect the region segments and types:

I added a sleep statement to slow down the loop which makes it easier to watch the memory changes over time. It is not needed here because tessapi.SetImage is already slow enough.

stweil commented 4 years ago

The PIL.Image object should go out of scope between pages/iterations [...]

Yes, it should, but something (maybe not the image) does not go out of scope after tessapi.SetImage(page_image). I already added test code which dumps the Python garbage objects at program exit, but that test code found nothing.

bertsky commented 4 years ago

I see. Interesting!

bertsky commented 4 years ago

So can you perhaps estimate the growth rate per page? That would give us an idea what kind of object (image, numpy-array, string/list/...) we are looking for. Also a comparison of the various processors could be relevant (which take page or page+region or page+region+line images).

as memory usage increases from less than 1 GB to more than 10 GB for about 300 pages.

This was for the line segmentation?

stweil commented 4 years ago

This loop is sufficient to show the leak:

for (n, input_file) in enumerate(self.input_files):
    page_id = input_file.pageId or input_file.ID
    LOG.info("INPUT FILE %i / %s", n, page_id)
    pcgts = page_from_file(self.workspace.download_file(input_file))
    page = pcgts.get_Page()

    ###
    page_image, page_coords, page_image_info = self.workspace.image_from_page(
        page, page_id)
    tessapi.SetImage(page_image)
    continue

Therefore I think it does not depend on the processor.

I also noted that the loop is really slow and contributes a very significant part to the total processing time, probably for all processors. Just reading the images is much faster.

bertsky commented 4 years ago

Therefore I think it does not depend on the processor.

Yes – thanks for narrowing it down to that.

I guess that if – as you said earlier – the leak still crucially depends on SetImage being there, then this means the GC does have reason to think tessapi still holds a reference to the image somewhere.

What if you replace SetImage(page_image) with SetImageFile(page.imageFilename)?

What if you replace SetImage(page_image) with this:

bytes_per_pixel = { 'RGBA': 4, 'RGB': 3, 'LA': 2, 'L': 1, '1': 1}.get(page_image.mode)
tessapi.SetImageBytes(page_image.tobytes(), page_image.width, page_image.height, bytes_per_pixel, bytes_per_pixel * page_image.width)

I also noted that the loop is really slow and contributes a very significant part to the total processing time, probably for all processors. Just reading the images is much faster.

What do you mean just reading the images? The raw page.imageFilename files perhaps? What kind of workspace/annotation are you running on? (Does it have cropping/binarization/deskewing etc?)

image_from_page must look for AlternativeImage and Border, sometimes it must crop/rotate and mask a new image on the fly, depending on the existing annotation.

stweil commented 4 years ago

I just found that the OCR-D Workspace class has an image_cache which is filled with each new image. There is no method to clear that cache (which should be done after processing of a page). CC @kba.

stweil commented 4 years ago

Bingo. Clearing the cache fixes the leak!

bertsky commented 4 years ago

I just found that the OCR-D Workspace class has an image_cache which is filled with each new image. There is no method to clear that cache (which should be done after processing of a page). CC @kba.

Blows my mind. Spirits I have conjured, no longer pay me heed.

stweil commented 4 years ago

This code was used to clear the cache:

--- ocrd_tesserocr/ocrd_tesserocr/segment_region.py 2020-01-09 16:34:39.398664316 +0100
+++ venv/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py   2020-01-12 18:56:35.812767316 +0100
@@ -138,7 +138,15 @@
                 # detect the region segments and types:
                 layout = tessapi.AnalyseLayout()
                 self._process_page(layout, page, page_image, page_coords, input_file.pageId)
-                
+
+                ###
+                # Clear the image cache after finishing the current page.   
+                self.workspace.image_cache = {
+                    'pil': {},
+                    'cv2': {},
+                    'exif': {},
+                }
+
                 # Use input_file's basename for the new file -
                 # this way the files retain the same basenames:
                 file_id = input_file.ID.replace(self.input_file_grp, self.output_file_grp)

@kba, I suggest to add a method for this to class Workspace, maybe not clear_image_cache, but finished_page (better suggestions welcome). And I'm afraid that many OCR-D processors will need an update to use that new method.

bertsky commented 4 years ago

I suggest to add a method for this to class Workspace, maybe not clear_image_cache, but finished_page (better suggestions welcome). And I'm afraid that many OCR-D processors will need an update to use that new method.

Or even better: deactivate in core by default, and add cache init/deinit methods for processors that would actually benefit from image caching (probably not many).

kba commented 4 years ago

https://github.com/OCR-D/core/issues/377