krvoigt commented 2 years ago

Current situation

Processors iterate over the files in a workspace on their own. While it is possible to restrict the processing to a single page or a list/range of pages, the API is targeted towards processors deriving the pages to process on their own. Setup functionality (like loading models or other data into memory) is intertwined with processing, making it difficult to separate the two (i.e. if doing pagewise processing with pageID restriction, the setup in process still happens for every call.

How it should be

The process method should be deprecated and replaced with a process_page method.

Processors should have a setup method that encapsulates all the post-initialization but pre-processing steps necessary for processing.

Steps

[ ] Refactor processor code in OCR-D/core to provide entry points for process_page and setup
[ ] Deprecate process
[ ] Test
[ ] Change all the processors
[ ] Communicate change in Tech Call
[ ] Reflect changed API in documentation

paulpestov commented 2 years ago

Maybe we could describe more what problem we are trying to solve and what users can expect after the implementation.

Setup functionality (like loading models or other data into memory) is intertwined with processing, making it difficult to separate the two.. E.g. why is it useful to make this separation?

PS: I think the purpose behind this feature would normally serve as epic description (Like "ruduce processing time by X to meet metric Y") and one of the actual user stories from that epic would be "as processor dev I want to process pages in parallel"

kba commented 2 years ago

Setup functionality (like loading models or other data into memory) is intertwined with processing, making it difficult to separate the two.. E.g. why is it useful to make this separation?

It improves performance because setting up the processor can be done just once instead of with every call to process.

OCR-D / zenhub

Pagewise Processing #2

Current situation

How it should be

Steps