Suggestions regarding Mets handling optimization

MehmedGIT commented 2 years ago

I have played with the code base of the OCR-D core and had an opportunity to implement and test some basic things. I have extended and created a child of the OcrdMets class (ocrd_mets.py) and implemented my own optimized versions of the find_files and add_file named mm_find_files and mm_add_file, respectively.

These three were optimized (red lines in BEFORE and green lines in AFTER screenshots)

startswith method of 'str' objects - the find_files method in ocrd_mets.py causes unnecessarily high usage of method calls (ncalls) to calculate the same thing over and over again inside a loop. Doing the calculations outside the loop and reusing the result greatly reduces the function calls from 37M to 60K for the tested data.
find_files in (ocrd_mets.py) and add_file in (ocrd_mets.py) - a basic caching was implemented and tested. Currently only caching mets:fileSec/mets:fileGrp and mets:file. The caching and space complexity (RAM usage) can be further optimized if I get to know which parts of the mets.xml file are accessed more than the other parts. Since removing certain parts of the mets.xml is not covered by the test, the cached data should get more HITS, i.e., to be more efficient when retrieving more data.

These two are not optimized yet (blue lines in the screenshots):

set_physical_page_for_file
remove_physical_page_fptr

Optimizing them as well will further improve the overall efficiency greatly!

TEST DATA and BUILDERS:

a mets.file with 100 pages for each currently possible file group (fileGrp) is created by each of the two build_mets. The first builder uses non-optimized versions of the add_file and find_files. The second builder uses the optimized versions (mm_add_file and mm_find_files).
the method signatures of mm_add_file and mm_find_files are matching the signatures of add_file and find_files. This guarantees that the optimized versions could be used by the outside world without any changes, which is great!
the content of the two files after the separate executions of the build_mets is the same. This means the builders produce the same output result (this is expected).
the files contain 35k lines (~2.3 MB)

BEFORE OPTIMIZATION: 1_original AFTER OPTIMIZATION: 2_improved

SUGGESTIONS:

Spot potential redundancy code in the core to improve overall efficiency. I suggest the cProfile library to test how methods perform. The screenshots have an output of a cProfiler.
General optimization of the ElementTree related searches by caching them in a dictionary.
Avoid unnecessary memory allocation/deallocation. For example, store the tree root in a variable, and access that variable instead of calling the tree.getroot(). Do the same for sub-trees that are accessed comparatively more frequently.
Cache only the elements which are more frequently used or they have a high amount of child nodes. Having the child nodes in a dictionary provides O(1) access time. Storing them in a list would require O(n) access time.
An advanced caching mechanism should be considered to find the optimal balance between computation complexity and space complexity. During the tests, no memory limit was given to the basic cache.

krvoigt commented 2 years ago

create discussion to discuss with the community

MehmedGIT commented 2 years ago

Here is additional info regarding the performance. The already available benchmark code helped me to write my own benchmark.

Benchmarking of the OcrdMets.find_files() with benchmark_find_files(number_of_pages) in the already available benchmarking code has an additional overhead of building METS files with the generator generated_mets(number_of_pages) when searching for a file inside 5, 10, 20, 50, 100 pages. Hence, the run time of the find_files() function is obfuscated (building the METS file takes a lot more time). What I would expect here is to generate the METS files beforehand or have METS files (different in size: small, medium, large) in the same directory or downloaded from a repo. Then get the run time of the find_files() function separately. That's how I implemented my benchmark and evaluated building METS and searching for files separately. Both with optimized and non-optimized versions of the functions.

bertsky commented 2 years ago

@MehmedGIT does any of this relate to my analysis in https://github.com/OCR-D/core/issues/723 yet?

I was not aware that such matters are discussed here instead of core...

MehmedGIT commented 2 years ago

@bertsky I was not aware of your analysis and performed my analysis independently. But after a fast glance at OCR-D/core#723, I see some potential overlaps. In my approach, as was already discussed in #723 I cached fileGrp and files belonging to these groups in dictionaries. Do the memory leaks probably come from the fact that nested dictionaries were not deallocated? I have not observed leaks in my analysis but I should look more closely to confirm it.

kba commented 2 years ago

@MehmedGIT does any of this relate to my analysis in OCR-D/core#723 yet?

I was not aware that such matters are discussed here instead of core...

They are not, the https://github.com/OCR-D/zenhub repo is for the managing the development in the coordination project. It's public because we try to be transparent, but everything that is to be discussed in the wider community will be posted to core, spec or wherever appropriate.

kba commented 2 years ago

@MehmedGIT can you push your optimized code to the https://github.com/OCR-D/core/tree/benchmark-mets branch pls? Then we can discuss this further in the context of a (proof-of-concept) PR?

bertsky commented 2 years ago

@MehmedGIT

Do the memory leaks probably come from the fact that nested dictionaries were not deallocated? I have not observed leaks in my analysis but I should look more closely to confirm it.

Like I said: caching tends to look like a leak if not controlled properly. It was just an assessment of our prior experiments/experiences.

bertsky commented 2 years ago

@kba

They are not, the https://github.com/OCR-D/zenhub repo is for the managing the development in the coordination project. It's public because we try to be transparent, but everything that is to be discussed in the wider community will be posted to core, spec or wherever appropriate.

Understood, thanks.

bertsky commented 2 years ago

I would suggest diversifying test scenarios BTW:

thousands of pages
dozens of fileGrps
fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

MehmedGIT commented 2 years ago

@MehmedGIT can you push your optimized code to the https://github.com/OCR-D/core/tree/benchmark-mets branch pls? Then we can discuss this further in the context of a (proof-of-concept) PR?

I can, but it is a really ugly code. Since I am not an expert in Python, I have explicitly imported some things and extended the OcrdMets with my own class "ExtendedOcrdMets" to make the changes without bothering to compile what is already available again.

EDIT: Pushed

krvoigt commented 2 years ago

@MehmedGIT what is the open task here? Can we close this Epic?

MehmedGIT commented 2 years ago

@krvoigt the information here is relevant for the benchmarking of the OCR-D software. IMO, currently, there is no open task here and I'm not working on it. After the benchmarking meeting this Thursday and a discussion about the requirements, we may have further tasks. I guess we can close this Epic and create new ones once we have a clear vision of what we want to do.

krvoigt commented 2 years ago

@MehmedGIT Thank you! Maybe you can document the next steps as a result of this descussion in the optimize mets handling epic https://github.com/OCR-D/zenhub/issues/7

OCR-D / zenhub

Suggestions regarding Mets handling optimization #39