Closed krvoigt closed 2 years ago
create discussion to discuss with the community
Here is additional info regarding the performance. The already available benchmark code helped me to write my own benchmark.
Benchmarking of the OcrdMets.find_files()
with benchmark_find_files(number_of_pages)
in the already available benchmarking code has an additional overhead of building METS files with the generator generated_mets(number_of_pages)
when searching for a file inside 5, 10, 20, 50, 100 pages. Hence, the run time of the find_files()
function is obfuscated (building the METS file takes a lot more time). What I would expect here is to generate the METS files beforehand or have METS files (different in size: small, medium, large) in the same directory or downloaded from a repo. Then get the run time of the find_files()
function separately. That's how I implemented my benchmark and evaluated building METS and searching for files separately. Both with optimized and non-optimized versions of the functions.
@MehmedGIT does any of this relate to my analysis in https://github.com/OCR-D/core/issues/723 yet?
I was not aware that such matters are discussed here instead of core...
@bertsky I was not aware of your analysis and performed my analysis independently. But after a fast glance at OCR-D/core#723, I see some potential overlaps. In my approach, as was already discussed in #723 I cached fileGrp and files belonging to these groups in dictionaries. Do the memory leaks probably come from the fact that nested dictionaries were not deallocated? I have not observed leaks in my analysis but I should look more closely to confirm it.
@MehmedGIT does any of this relate to my analysis in OCR-D/core#723 yet?
I was not aware that such matters are discussed here instead of core...
They are not, the https://github.com/OCR-D/zenhub repo is for the managing the development in the coordination project. It's public because we try to be transparent, but everything that is to be discussed in the wider community will be posted to core, spec or wherever appropriate.
@MehmedGIT can you push your optimized code to the https://github.com/OCR-D/core/tree/benchmark-mets branch pls? Then we can discuss this further in the context of a (proof-of-concept) PR?
@MehmedGIT
Do the memory leaks probably come from the fact that nested dictionaries were not deallocated? I have not observed leaks in my analysis but I should look more closely to confirm it.
Like I said: caching tends to look like a leak if not controlled properly. It was just an assessment of our prior experiments/experiences.
@kba
They are not, the https://github.com/OCR-D/zenhub repo is for the managing the development in the coordination project. It's public because we try to be transparent, but everything that is to be discussed in the wider community will be posted to core, spec or wherever appropriate.
Understood, thanks.
I would suggest diversifying test scenarios BTW:
ocrd-cis-ocropy-dewarp
for line level image files or ocrd-cis-ocropy-deskew
/ ocrd-tesserocr-deskew
for region level files)@MehmedGIT can you push your optimized code to the https://github.com/OCR-D/core/tree/benchmark-mets branch pls? Then we can discuss this further in the context of a (proof-of-concept) PR?
I can, but it is a really ugly code. Since I am not an expert in Python, I have explicitly imported some things and extended the OcrdMets with my own class "ExtendedOcrdMets" to make the changes without bothering to compile what is already available again.
EDIT: Pushed
@MehmedGIT what is the open task here? Can we close this Epic?
@krvoigt the information here is relevant for the benchmarking of the OCR-D software. IMO, currently, there is no open task here and I'm not working on it. After the benchmarking meeting this Thursday and a discussion about the requirements, we may have further tasks. I guess we can close this Epic and create new ones once we have a clear vision of what we want to do.
@MehmedGIT Thank you! Maybe you can document the next steps as a result of this descussion in the optimize mets handling epic https://github.com/OCR-D/zenhub/issues/7
I have played with the code base of the
OCR-D core
and had an opportunity to implement and test some basic things. I have extended and created a child of theOcrdMets
class (ocrd_mets.py
) and implemented my own optimized versions of thefind_files
andadd_file
namedmm_find_files
andmm_add_file
, respectively.These three were optimized (red lines in BEFORE and green lines in AFTER screenshots)
startswith method of 'str' objects
- thefind_files
method inocrd_mets.py
causes unnecessarily high usage of method calls (ncalls
) to calculate the same thing over and over again inside a loop. Doing the calculations outside the loop and reusing the result greatly reduces the function calls from 37M to 60K for the tested data.find_files in (ocrd_mets.py)
andadd_file in (ocrd_mets.py)
- a basic caching was implemented and tested. Currently only cachingmets:fileSec/mets:fileGrp
andmets:file
. The caching and space complexity (RAM usage) can be further optimized if I get to know which parts of themets.xml
file are accessed more than the other parts. Since removing certain parts of the mets.xml is not covered by the test, the cached data should get more HITS, i.e., to be more efficient when retrieving more data.These two are not optimized yet (blue lines in the screenshots):
set_physical_page_for_file
remove_physical_page_fptr
Optimizing them as well will further improve the overall efficiency greatly!
TEST DATA and BUILDERS:
build_mets
. The first builder uses non-optimized versions of theadd_file
andfind_files
. The second builder uses the optimized versions (mm_add_file
andmm_find_files
).mm_add_file
andmm_find_files
are matching the signatures ofadd_file
andfind_files
. This guarantees that the optimized versions could be used by the outside world without any changes, which is great!build_mets
is the same. This means the builders produce the same output result (this is expected).BEFORE OPTIMIZATION: AFTER OPTIMIZATION:
SUGGESTIONS:
cProfile
library to test how methods perform. The screenshots have an output of a cProfiler.tree.getroot()
. Do the same for sub-trees that are accessed comparatively more frequently.