OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

New processor API #1240

Open bertsky opened 3 months ago

bertsky commented 3 months ago

This is a first attempt at implementing #322 – without the actual error handling or parallel processing (so more or less refactoring and deprecation). It also enables fixing #579 in the process.

bertsky commented 3 months ago

A couple of points worth discussing:

bertsky commented 3 months ago

BTW, I consider it bad practice that we always load the ocrd-tool.json in the inheriting constructor.

It should really be a class attribute. And assuming it is always placed in the same spot of the Python distribution (which is currently the case but not strictly required as our specs only require it in the repository root), one can even load it in the superclass (so nothing would have to be done in the inheriting code).

bertsky commented 3 months ago

BTW, I consider it bad practice that we always load the ocrd-tool.json in the inheriting constructor.

It should really be a class attribute. And assuming it is always placed in the same spot of the Python distribution (which is currently the case but not strictly required as our specs only require it in the repository root), one can even load it in the superclass (so nothing would have to be done in the inheriting code).

I now solved this by auto-loading the ocrd-tool.json in the superclass and storing it under the property metadata (as well as derived version and ocrd_tool). The executable name gets auto-inferred from the entry point at runtime (bottom stack frame). So in the standard case there will be no more need to do any initialisation boilerplate. If necessary, one can easily override these properties in the subclass (as is in fact necessary for the built-in dummy processor and in the special cases around bashlib). It's still possible to set ocrd_tool and version via kwarg in the constructor, but this has been deprecated. See last 3 commits.

To showcase this, I started migrating ocrd_tesserocr to the new core API under https://github.com/OCR-D/ocrd_tesserocr/pull/216 – it's still using the old process functions ATM, I'll migrate these in the next step. (Tests do all run through.)

bertsky commented 3 months ago

Notice how Processor.setup currently only gets called if one is using run_processor (via get_processor actually). So it's not enough anymore to just instantiate a processor and call process or process_workspace – one would also need to call setup in that case.

bertsky commented 3 months ago

To showcase this, I started migrating ocrd_tesserocr to the new core API under OCR-D/ocrd_tesserocr#216 – it's still using the old process functions ATM, I'll migrate these in the next step. (Tests do all run through.)

Now also uses process_page_pcgts instead of process. In doing that I noticed there is one more problematic case: in ocrd-tesserocr-recognize we need the mapping returned from page_from_file(... with_etree=True). But how to make that configurable as a standard API? The list of input_pcgts already has multi-valued semantics (one per input fileGrp). And we usually don't need the etree. In this case I solved by exporting it from the model in the inheriting class ad hoc, which is inefficient...

bertsky commented 1 month ago
  1. Inside get_processor why is setup() not called in case the instance_caching flag is set? When the flag is set and there is no cached processor yet, the setup() is never called.

Oh! You are right – thanks for observing this! I forgot to add that call to get_cached_processor as well.

2. Could you decrypt that part for me from the ocrd_tesserocr PR 216?

pcgts_mapping = dict()
_ = pcgts.to_etree(mapping_=pcgts_mapping)
page = pcgts.get_Page()

Does the to_etree method have side effects for the pcgts class?

No, it does not. But the function will return the etree and mapping and reverse mapping in addition to the PcGts instance.

When we introduced the etree option, always changing the return type to tuple would have been a behavioural (i.e. API) change, hence we stayed backwards compatible and made the return type dependent on the kwarg.

Why is that not happening just inside get_Page() with an extra parameter pcgts_mapping?

Because you cannot reach the original etree from the DOM object (unless you generate a new one). For generateDS itself, it's a zero cost effect to return the etree and mappings along with the constructed model. We used to only need the model, but later added code in ocrd_tesserocr that needs (runtime-configurable) XPath queries, so accessing the etree became a requirement as well.

I lack understanding of what is going on inside the auto-generated ocrd_models.ocrd_page_generateds.py

generateDS constructs the model from the schema according to its ruleset. But in doing so, it of course uses lxml and operates on the etree back and forth. So the parse function offers returning these other data as well.

Why is that inefficient in comparison to ocrd_modelfactory.page_from_image:

mapping = dict()
etree = pcgts.to_etree(mapping_=mapping)
revmap = dict(((node, element) for element, node in mapping.items()))
return pcgts, etree, mapping, revmap

Because this creates a new etree (after we have thrown away the etree that was available during parsing).

It suffices for ocrd_tesserocr, and I'm not sure if we will really need etrees in other processors as well. It depends on the paradigm we want to have for dynamic workflows. Passing XPath expressions as runtime parameters is dynamic, but complicates the processors themselves. In contrast, having a more flexible workflow syntax which will select different parameters or processors in follow-up steps depending on certain data queries is more powerful (but also more difficult to implement and control).

I should make a side experiment to determine how expensive this actually is. If significant, we should have a new API which always yields the etree and mappings.

bertsky commented 1 month ago

There is of course lots of dependent stuff we need to discuss (error handling, parallelization, dynamic workflows) but seeing as this is backwards-compatible, I see no reason not to proceed with a pre-release ASAP.

Agreed. I am working on a secondary PR that will implement error handling. I guess we'll have to see how soon that is ready (could be merged first, or become a later pre-release).

One thing I did notice while testing is that processors implement parts of this API already but not necessarily with the same semantics, e.g. many processors have a setup method but do not do the filegrp cardinality checks

Indeed. But that's not a problem IMO. These assertions are, after all, just additional safety belts. All processors use setup in the way we intend to use it here. But there is conflict in having it called during the constructor already (as all 2.x based processors do, mostly by checking if hasattr(self, 'output_file_grp')): as we obviously don't want to have it run twice, these processors must be changed to remove that setup call before 3.0 can be fully released or hold their ocrd<3.0 dependency.

I just noticed I created another problem regarding setup: placing fileGrp assertions there conflicts with instance caching, which does not know about the output_file_grp in advance.

Perhaps we should make the cardinalities a property in ocrd-tool.json and enforce that in the superclass?

there or the ocrd_cis processors that have a self.ocrd_tool attribute that clashes with the one defined here.

Oh, indeed. Another reason to not release 3.0 directly.

And we should make sure that setup is called only once and esp. never in the processor's constructor. In fact, with the ocrd-tool/executable magic you implemented, constructors should need overriding only in exceptional cases.

Yes.

kba commented 1 month ago

Perhaps we should make the cardinalities a property in ocrd-tool.json and enforce that in the superclass?

Since the cardinality is defined as a constant, I see no reason not to put that in the ocrd-tool.json. We should get rid of/deprecate the input_file_grp and output_file_grp properties which serve no real purpose beyond confusing users in the --help output.

I just noticed I created another problem regarding setup: placing fileGrp assertions there conflicts with instance caching, which does not know about the output_file_grp in advance.

This is a problem though. Even with the cardinalities are defined in ocrd-tool.json, at which point would you do the actual "split by commas, count tokens, compare to ocrd_tool['output_file_grp_cardinality'] check?

bertsky commented 1 month ago

Since the cardinality is defined as a constant, I see no reason not to put that in the ocrd-tool.json.

Not necessarily a constant (more of a range, usually a maximum or minimum int).

We should get rid of/deprecate the input_file_grp and output_file_grp properties which serve no real purpose beyond confusing users in the --help output.

Yes.

This is a problem though. Even with the cardinalities defined in ocrd-tool.json, at which point would you do the actual "split by commas, count tokens, compare to ocrd_tool['output_file_grp_cardinality'] check?

In Processor.process_workspace for example. (Whether or not that function takes the fileGrps as proper arguments or from the instance. Should be in the superclass though.)

bertsky commented 1 month ago

Partial CI failure seems to be the Docker Compose race again:

 Container ocrd_network_mongo_db  Error
 Container ocrd_network_rabbit_mq  Healthy
dependency failed to start: container ocrd_network_mongo_db exited (48)
make: *** [Makefile:293: network-integration-test-cicd] Error 1

Maybe we just need to pull from master.

MehmedGIT commented 1 month ago

Partial CI failure seems to be the Docker Compose race again:

 Container ocrd_network_mongo_db  Error
 Container ocrd_network_rabbit_mq  Healthy
dependency failed to start: container ocrd_network_mongo_db exited (48)
make: *** [Makefile:293: network-integration-test-cicd] Error 1

Maybe we just need to pull from master.

Just restart that failing single test. Not a race condition. Rather the Mongo DB service did not start at the expected time.

bertsky commented 1 month ago

We now have …

… more or less succeeding against 3.0.0a1, which tells us that most likely we won't need 3.0.0a2 etc.

(But perhaps porting at least ocrd_segment and cor-asv-ann to include some more demanding use-cases is now due and should be expected to produce more changes here.)

So let's refocus on the next features for b1 and rc1, candidates of which would be AFAICS:

bertsky commented 1 month ago

@kba please release as a2 or b1 (no opinion here)

bertsky commented 1 month ago

Unfortunately, in the current state, I reintroduced the problem that if a processor has mandatory parameters without defaults, you cannot instantiate it without those. So not only does --help and --dump-json not work for them, also any normal processing call with parameters fails, because we first instantiate a disposable instance without parameters. So the separation of constructor / decorator and processing / non-processing calls is still not correct.

kba commented 1 month ago

Unfortunately, in the current state, I reintroduced the problem that if a processor has mandatory parameters without defaults, you cannot instantiate it without those. So not only does --help and --dump-json not work for them, also any normal processing call with parameters fails, because we first instantiate a disposable instance without parameters. So the separation of constructor / decorator and processing / non-processing calls is still not correct.

OK, can confirm. I'll try to find a fix. But first, a2: https://github.com/OCR-D/core/releases/tag/v3.0.0a2 / https://pypi.org/project/ocrd/3.0.0a2/

bertsky commented 1 month ago

Unfortunately, in the current state, I reintroduced the problem that if a processor has mandatory parameters without defaults, you cannot instantiate it without those. So not only does --help and --dump-json not work for them, also any normal processing call with parameters fails, because we first instantiate a disposable instance without parameters. So the separation of constructor / decorator and processing / non-processing calls is still not correct.

cf. https://github.com/bertsky/core/pull/14

How about also introducing a Processor.shutdown – to be overriden by implementors if necessary (just like setup), to be called by core before

kba commented 1 month ago

How about also introducing a Processor.shutdown – to be overriden by implementors if necessary (just like setup), to be called by core before

* exiting 
* re-running `setup` after `parameter` changed
* removing an instance from the processor cache

While I think that most processors won't need that, there might be non-python data structures that might need explicitly be freed or shut down. Since this will be optional, with a no-op default implementation, I see no reason not to give processor developers this option.

MehmedGIT commented 1 month ago

Please also consider my suggestion here.

What do you think about the following suggestion: Instead of keeping a single cache in the core for all processors, we could keep a list of caches. Each unique processor will have its separate cache. Then for each cache, the core will invoke a method to decide what is the desired max instances number. I am aware, that this still does not fully solve all the issues, but at least we get more flexibility and the implementers could override the default method to return the best number for each specific processor.

bertsky commented 1 month ago

@kba feel free to make next pre-release!

(Obviously, we'll have to pull from master and resolve changelog+version before we finally merge 3.0.)

bertsky commented 1 month ago

Note: we still have a problem with the docstrings in the superclass. They are useful for the API doc of core, but they are also used by generate_processor_help in concrete processors – i.e. they are somehow inherited and shown in the self-documentation of implementations.

bertsky commented 1 month ago

Note: we still have a problem with the docstrings in the superclass. They are useful for the API doc of core, but they are also used by generate_processor_help in concrete processors – i.e. they are somehow inherited and shown in the self-documentation of implementations.

Python documentation for all versions says this:

function.__doc__: The function’s documentation string, or None if unavailable. Not inherited by subclasses.

However, the inspect.getdoc() behaviour on top of __doc__ changed with 3.5:

The inspect.getdoc() function now returns documentation strings inherited from base classes. Documentation strings no longer need to be duplicated if the inherited documentation is appropriate. To suppress an inherited string, an empty string must be specified (or the documentation may be filled in). This change affects the output of the pydoc module and the help() function. (Contributed by Serhiy Storchaka in bpo-15582.)

However, in real life I get this:

class A:
    """class A"""
    def test(self):
        """test func"""
        pass

class B(A):
    pass

class C(B):
    def test(self):
        pass

>>> B.test.__doc__
'test func'
>>> B().test.__doc__
'test func'
>>> C.test.__doc__
>>> C().test.__doc__
>>> 

Puzzling!

bertsky commented 1 month ago

I found https://bugs.python.org/issue40257 and https://bugs.python.org/issue15582 touching on the docstring inheritance issue, but it's still unclear to me why Python behaves the way it does. I can see the point of repeating in getdoc(), but it seems stupid for __doc__.

Anyway, I managed to solve this simply by comparing with Processor docstrings. (For that, I had to refactor slightly, moving generate_processor_help from helpers to base module to avoid circular import.)

The docstrings help texts in our ported processors now all look fine.

I also added another config value (OCRD_MAX_MISSING_OUTPUTS) which will allow raising an exception if too many pages had to be skipped or fallback-copied due to exceptions.

bertsky commented 1 month ago

BTW, the default 10% for the new OCRD_MAX_MISSING_OUTPUTS may not be what we want. We could easily set -1 or 0 to change that to an opt-in.

bertsky commented 1 month ago

@kba I would say it's time for b2!

What I still don't like conceptually is that we set Processor.workspace during Processor.process_workspace(workspace), but do not unset it (delattr) it afterwards. Also, we set Processor.output_file_grp, Processor.input_file_grp and Processor.page_id outside of Processor.process_workspace. We are in a unique position now to make it cleaner: define workspace, fileGrps, pageId etc as Optional properties, pass all of them as arguments to processor_workspace, assigning them to the instance for the duration of that function, and finally setting to None again. What do you think?

Before we can do a RC, we still have to run some actual workflows on actual data...

bertsky commented 1 month ago

BTW, the default 10% for the new OCRD_MAX_MISSING_OUTPUTS may not be what we want. We could easily set -1 or 0 to change that to an opt-in.

Same goes for the default OCRD_MISSING_INPUT=SKIP / OCRD_MISSING_OUTPUT=SKIP / OCRD_EXISTING_OUTPUT=SKIP – this is a significant behavioural change – we used to have ABORT effectively. If we keep the new defaults, we should at least make a big disclaimer about that in the documentation. And all regression tests / CI will have to actively set these to ABORT.

bertsky commented 3 weeks ago

Another backwards compatibility problem is behaviour for --overwrite: our v2 processors use their own Workspace.add_file calls, without setting force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE'. They expect the workspace to implicitly handle that via Workspace.overwrite_mode, which I removed in favour of the config mechanism.

Any ideas how we can best get a workaround for that (at least for some grace period)?

bertsky commented 3 weeks ago

Another backwards compatibility problem is behaviour for --overwrite: our v2 processors use their own Workspace.add_file calls, without setting force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE'. They expect the workspace to implicitly handle that via Workspace.overwrite_mode, which I removed in favour of the config mechanism.

Perhaps, in v3, instead of using force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE' whereever we can control, we should rather do if force or ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE' within add_file generally?

EDIT: I did exactly that – in hindsight, it also seems natural.