Closed stweil closed 4 years ago
ocrd-tesserocr-recognize
also seems to leak memory with each page.
Interesting! Have you seen anything like this on Tesseract CLI (e.g. when using file lists)?
We don't End()
(i.e. deinitialize) the API between pages – which we could, but that would add the cost of loading models into memory each time, which clearly should not be necessary. (If it is, there's likely a bug in Tesseract.)
Can you try adding tessapi.Clear()
between pages? (But AFAIK tesserocr already does pixDestroy()
on the image data between successive SetImage()
and free()
on the text results between successive GetUTF8Text()
. Also, it has __dealloc__()
on all iterator classes, which del
the iterator, as they should according to Tesseract's API documentation. But who knows, maybe in some corner of tesserocr this was forgotten.)
We had memory leaks for the Tesseract command line tools in the past, but fixed them. Tests with Valgrind and clang sanitizers show no remaining leaks for all test cases.
I already tried to find the leak in ocrd-tesserocr-recognize using Valgrind, but get lots of problem reports for Python code which seem to hide the problem I was searching for,
We had memory leaks for the Tesseract command line tools in the past, but fixed them. Tests with Valgrind and clang sanitizers show no remaining leaks for all test cases.
And do these tests include (very) long input file lists?
Please try your setup with Clear()
between pages!
And do these tests include (very) long input file lists?
They would find even small leaks of a few bytes, so unless there is a special mechanism for long input file lists it does not make a difference whether a test iterates about two or 2000 files.
They would find even small leaks of a few bytes, so unless there is a special mechanism for long input file lists it does not make a difference whether a test iterates about two or 2000 files.
True, the number of files is irrelevant. But having a loop or no loop can make a difference – that's why I was asking for the "input file list" mode. Anyway, I justed did a simple test myself and cannot find an increase in memory (using memtime
).
So we are looking for something in tesserocr...
I can report that tessapi.Clear()
between pages is not helpful as a workaround.
@stweil Can you please isolate this into a minimal example (without OCR-D dependencies) and report on sirfz/tesserocr?
There is now a first usable result from Valgrind from running ocrd-tesserocr-segment-line
on a small set of five pages:
==261362== 672 bytes in 40 blocks are definitely lost in loss record 730 of 1,097
==261362== at 0x483650F: operator new[](unsigned long) (vg_replace_malloc.c:433)
==261362== by 0x57CBC10: tesseract::TessBaseAPI::GetComponentImages(tesseract::PageIteratorLevel, bool, bool, int, Pixa**, int**, int**) (baseapi.cpp:736)
==261362== by 0x57A1B5F: __pyx_pf_9tesserocr_13PyTessBaseAPI_70GetComponentImages (tesserocr.cpp:20199)
==261362== by 0x57A1B5F: __pyx_pw_9tesserocr_13PyTessBaseAPI_71GetComponentImages(_object*, _object*, _object*) (tesserocr.cpp:20084)
==261362== by 0x5C9B94: _PyMethodDef_RawFastCallKeywords (in /usr/bin/python3.7)
I'll investigate that further.
The implementation in tesserocr is buggy here for the last two arguments: it passes the addresses of _blockids
and _paraids
, so those arguments are never NULL, and memory will be allocated.
I'll test this and create a pull request for tesserocr, but not write an extra issue there.
Here are first test results with a fixed GetComponentImages
.
Before fix:
==261362== LEAK SUMMARY:
==261362== definitely lost: 672 bytes in 40 blocks
==261362== indirectly lost: 0 bytes in 0 blocks
==261362== possibly lost: 142,899 bytes in 80 blocks
After fix:
==262539== LEAK SUMMARY:
==262539== definitely lost: 0 bytes in 0 blocks
==262539== indirectly lost: 0 bytes in 0 blocks
==262539== possibly lost: 142,899 bytes in 80 blocks
The implementation in tesserocr is buggy here for the last two arguments: it passes the addresses of
_blockids
and_paraids
, so those arguments are never NULL, and memory will be allocated.
Looks correct to me, though. They are deallocated afterwards (although not via delete[]
, as documented in Tesseract's baseapi.h, but free
). Or are you referring to the boxa == NULL
case?
They are only deallocated for True
case, but not for the more common False
case.
The bad news is that there seem to be more leaks.
but not for the more common
False
case.
But in the False
case, NULL
is passed, so they don't get alloc'd in libtesseract anyway. (Also, this is the less common case, being non-default.)
Have you tried replacing free
with delete[]
?
No, the address of null pointers is passed, and the address of a local variable is never NULL
. That's the bug I mentioned.
No, the address of null pointers is passed, and the address of a local variable is never
NULL
. That's the bug I mentioned.
Oh, of course! It's so easy to miss. The &
is wrong.
There remains definitely still a major leak, as memory usage increases from less than 1 GB to more than 10 GB for about 300 pages.
I can now reproduce the leak when the loop over the images only runs until tessapi.SetImage(page_image)
. I see no leak if the loop continues before that statement.
There is no leak when a similar loop is run in tesserocr
.
Maybe the page_image
returned by self.workspace.image_from_page
is never released as soon as it was used at least once?
I can now reproduce the leak when the loop over the images only runs until
tessapi.SetImage(page_image)
What processor are you referring to? (Your previous comments related to segment_line
, which uses region images.)
Maybe the
page_image
returned byself.workspace.image_from_page
is never released as soon as it was used at least once?
What do you mean by released? The PIL.Image object should go out of scope between pages/iterations, so the runtime GC should be able to pick it up. I don't see anyone holding a reference to it which does not in turn go out of scope.
What processor are you referring to?
The latest tests were made with ocrd-tesserocr-segment-region
and a small code modification:
--- ocrd_tesserocr/segment_region.py 2020-01-09 16:34:39.398664316 +0100
+++ venv/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py 2020-01-12 16:42:16.188062969 +0100
@@ -132,7 +132,13 @@
tessapi.SetVariable('user_defined_dpi', str(dpi))
LOG.info("Detecting regions in page '%s'", page_id)
+
tessapi.SetImage(page_image) # is already cropped to Border
+ ###
+ from time import sleep
+ sleep(0.1)
+ continue
+
tessapi.SetPageSegMode(PSM.AUTO) # (default)
# detect the region segments and types:
I added a sleep statement to slow down the loop which makes it easier to watch the memory changes over time. It is not needed here because tessapi.SetImage
is already slow enough.
The PIL.Image object should go out of scope between pages/iterations [...]
Yes, it should, but something (maybe not the image) does not go out of scope after tessapi.SetImage(page_image)
. I already added test code which dumps the Python garbage objects at program exit, but that test code found nothing.
I see. Interesting!
So can you perhaps estimate the growth rate per page? That would give us an idea what kind of object (image, numpy-array, string/list/...) we are looking for. Also a comparison of the various processors could be relevant (which take page or page+region or page+region+line images).
as memory usage increases from less than 1 GB to more than 10 GB for about 300 pages.
This was for the line segmentation?
This loop is sufficient to show the leak:
for (n, input_file) in enumerate(self.input_files):
page_id = input_file.pageId or input_file.ID
LOG.info("INPUT FILE %i / %s", n, page_id)
pcgts = page_from_file(self.workspace.download_file(input_file))
page = pcgts.get_Page()
###
page_image, page_coords, page_image_info = self.workspace.image_from_page(
page, page_id)
tessapi.SetImage(page_image)
continue
Therefore I think it does not depend on the processor.
I also noted that the loop is really slow and contributes a very significant part to the total processing time, probably for all processors. Just reading the images is much faster.
Therefore I think it does not depend on the processor.
Yes – thanks for narrowing it down to that.
I guess that if – as you said earlier – the leak still crucially depends on SetImage
being there, then this means the GC does have reason to think tessapi
still holds a reference to the image somewhere.
What if you replace SetImage(page_image)
with SetImageFile(page.imageFilename)
?
What if you replace SetImage(page_image)
with this:
bytes_per_pixel = { 'RGBA': 4, 'RGB': 3, 'LA': 2, 'L': 1, '1': 1}.get(page_image.mode)
tessapi.SetImageBytes(page_image.tobytes(), page_image.width, page_image.height, bytes_per_pixel, bytes_per_pixel * page_image.width)
I also noted that the loop is really slow and contributes a very significant part to the total processing time, probably for all processors. Just reading the images is much faster.
What do you mean just reading the images? The raw page.imageFilename
files perhaps? What kind of workspace/annotation are you running on? (Does it have cropping/binarization/deskewing etc?)
image_from_page
must look for AlternativeImage and Border, sometimes it must crop/rotate and mask a new image on the fly, depending on the existing annotation.
I just found that the OCR-D Workspace class has an image_cache
which is filled with each new image. There is no method to clear that cache (which should be done after processing of a page). CC @kba.
Bingo. Clearing the cache fixes the leak!
I just found that the OCR-D Workspace class has an
image_cache
which is filled with each new image. There is no method to clear that cache (which should be done after processing of a page). CC @kba.
Blows my mind. Spirits I have conjured, no longer pay me heed.
This code was used to clear the cache:
--- ocrd_tesserocr/ocrd_tesserocr/segment_region.py 2020-01-09 16:34:39.398664316 +0100
+++ venv/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py 2020-01-12 18:56:35.812767316 +0100
@@ -138,7 +138,15 @@
# detect the region segments and types:
layout = tessapi.AnalyseLayout()
self._process_page(layout, page, page_image, page_coords, input_file.pageId)
-
+
+ ###
+ # Clear the image cache after finishing the current page.
+ self.workspace.image_cache = {
+ 'pil': {},
+ 'cv2': {},
+ 'exif': {},
+ }
+
# Use input_file's basename for the new file -
# this way the files retain the same basenames:
file_id = input_file.ID.replace(self.input_file_grp, self.output_file_grp)
@kba, I suggest to add a method for this to class Workspace
, maybe not clear_image_cache
, but finished_page
(better suggestions welcome). And I'm afraid that many OCR-D processors will need an update to use that new method.
I suggest to add a method for this to
class Workspace
, maybe notclear_image_cache
, butfinished_page
(better suggestions welcome). And I'm afraid that many OCR-D processors will need an update to use that new method.
Or even better: deactivate in core by default, and add cache init/deinit methods for processors that would actually benefit from image caching (probably not many).
The memory usage of
ocrd-tesserocr-segment-region
increases for each page, resulting in a total of about 7 GB for 200 pages, 8 GB for 248 pages, 10 GB for 282 pages, 11 GB for 313 pages (observed for http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-22977).ocrd-tesserocr-segment-line
shows a similar effect.For that book, a machine with 8 GB RAM would have started swapping, thus slowing down the process extremely. Even a large server would get memory problems when processing large books with more than 1000 pages in parallel.