OCR-D / zenhub

Repo for developing zenhub integration
Apache License 2.0
0 stars 0 forks source link

Suggestions regarding Mets handling optimization #39

Closed krvoigt closed 2 years ago

MehmedGIT commented 2 years ago

I have played with the code base of the OCR-D core and had an opportunity to implement and test some basic things. I have extended and created a child of the OcrdMets class (ocrd_mets.py) and implemented my own optimized versions of the find_files and add_file named mm_find_files and mm_add_file, respectively.

These three were optimized (red lines in BEFORE and green lines in AFTER screenshots)

These two are not optimized yet (blue lines in the screenshots):

Optimizing them as well will further improve the overall efficiency greatly!

TEST DATA and BUILDERS:

BEFORE OPTIMIZATION: 1_original AFTER OPTIMIZATION: 2_improved

SUGGESTIONS:

krvoigt commented 2 years ago

create discussion to discuss with the community

MehmedGIT commented 2 years ago

Here is additional info regarding the performance. The already available benchmark code helped me to write my own benchmark.

Benchmarking of the OcrdMets.find_files() with benchmark_find_files(number_of_pages) in the already available benchmarking code has an additional overhead of building METS files with the generator generated_mets(number_of_pages) when searching for a file inside 5, 10, 20, 50, 100 pages. Hence, the run time of the find_files() function is obfuscated (building the METS file takes a lot more time). What I would expect here is to generate the METS files beforehand or have METS files (different in size: small, medium, large) in the same directory or downloaded from a repo. Then get the run time of the find_files() function separately. That's how I implemented my benchmark and evaluated building METS and searching for files separately. Both with optimized and non-optimized versions of the functions.

Benchmark.png

bertsky commented 2 years ago

@MehmedGIT does any of this relate to my analysis in https://github.com/OCR-D/core/issues/723 yet?

I was not aware that such matters are discussed here instead of core...

MehmedGIT commented 2 years ago

@bertsky I was not aware of your analysis and performed my analysis independently. But after a fast glance at OCR-D/core#723, I see some potential overlaps. In my approach, as was already discussed in #723 I cached fileGrp and files belonging to these groups in dictionaries. Do the memory leaks probably come from the fact that nested dictionaries were not deallocated? I have not observed leaks in my analysis but I should look more closely to confirm it.

kba commented 2 years ago

@MehmedGIT does any of this relate to my analysis in OCR-D/core#723 yet?

I was not aware that such matters are discussed here instead of core...

They are not, the https://github.com/OCR-D/zenhub repo is for the managing the development in the coordination project. It's public because we try to be transparent, but everything that is to be discussed in the wider community will be posted to core, spec or wherever appropriate.

kba commented 2 years ago

@MehmedGIT can you push your optimized code to the https://github.com/OCR-D/core/tree/benchmark-mets branch pls? Then we can discuss this further in the context of a (proof-of-concept) PR?

bertsky commented 2 years ago

@MehmedGIT

Do the memory leaks probably come from the fact that nested dictionaries were not deallocated? I have not observed leaks in my analysis but I should look more closely to confirm it.

Like I said: caching tends to look like a leak if not controlled properly. It was just an assessment of our prior experiments/experiences.

bertsky commented 2 years ago

@kba

They are not, the https://github.com/OCR-D/zenhub repo is for the managing the development in the coordination project. It's public because we try to be transparent, but everything that is to be discussed in the wider community will be posted to core, spec or wherever appropriate.

Understood, thanks.

bertsky commented 2 years ago

I would suggest diversifying test scenarios BTW:

MehmedGIT commented 2 years ago

@MehmedGIT can you push your optimized code to the https://github.com/OCR-D/core/tree/benchmark-mets branch pls? Then we can discuss this further in the context of a (proof-of-concept) PR?

I can, but it is a really ugly code. Since I am not an expert in Python, I have explicitly imported some things and extended the OcrdMets with my own class "ExtendedOcrdMets" to make the changes without bothering to compile what is already available again.

EDIT: Pushed

krvoigt commented 2 years ago

@MehmedGIT what is the open task here? Can we close this Epic?

MehmedGIT commented 2 years ago

@krvoigt the information here is relevant for the benchmarking of the OCR-D software. IMO, currently, there is no open task here and I'm not working on it. After the benchmarking meeting this Thursday and a discussion about the requirements, we may have further tasks. I guess we can close this Epic and create new ones once we have a clear vision of what we want to do.

krvoigt commented 2 years ago

@MehmedGIT Thank you! Maybe you can document the next steps as a result of this descussion in the optimize mets handling epic https://github.com/OCR-D/zenhub/issues/7