OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Use ocrd-tesserocr with caching for Processing Server not possible #190

Closed joschrew closed 1 year ago

joschrew commented 1 year ago

In the processing server pull request we want to use caching. Therefore an instance of a processor without a workspace is created. Later when the user requests a processor-run we want to set the workspace into that cache-instance.

The problem seems to be that ocrd-tesserocr processors delegates the processing internally to the processor TesserocrRecognize`: https://github.com/OCR-D/ocrd_tesserocr/blob/c10f94d2b36815314687a216495e8602b4a16bb0/ocrd_tesserocr/segment_region.py#L13-L32

The workspace is transferred to this class on creation. If the workspace is (re-)set afterwards it does not get delegated to the created instance of TesserocrRecognize

bertsky commented 1 year ago

Indeed, this pattern is a problem.

if hasattr(self, 'workspace'):

We could easily rewrite this by testing for output_file_grp instead of workspace AFAICS. (It's just there to differentiate between the other, non-processing CLI contexts, like dump and help.)

joschrew commented 1 year ago

The only solution I see is making workspace a property in the processor-base-class. And then override the setter in ocrd_tesserorc and transfer the newly set workspace to the wrapped instance of TesserocrSegmentRegion as well. But I don't know if that can be done in ocr-d-core without problems.

bertsky commented 1 year ago

The only solution I see is making workspace a property in the processor-base-class. And then override the setter in ocrd_tesserorc and transfer the newly set workspace to the wrapped instance of TesserocrSegmentRegion as well.

I don't understand how that would help. The kwarg transfer needs to happen in the constructor already.

Within the new paradigm of the changed processor API, differentiating between processing and non-processing context becomes painless. (The former can happen in setup, including all the parameter kwarg transfers.)

But for the moment, as I said, exchanging workspace for output_file_grp test should be safe. (And of course, passing *args indiscriminately, not self.workspace explicitly.)

joschrew commented 1 year ago

I don't understand how that would help. The kwarg transfer needs to happen in the constructor already.

That's the problem I am trying to describe. At the constructor we don't have the workspace yet because we are creating a cached instance. We want to set the workspace later after the constructor has finished. But when we do that (setting the workspace) the wrapped instance of TesserocrRecognize does not get updated with the "new" workspace.

With the code snipped (ocrd_tesserocr/ocrd_tesserocr/segment_region.py) I only wanted to show that the processor wraps an instance of TesserocrRecognize which does the real work. The if hasattr(self, 'workspace') is not a problem when the workspace is set to None with the first positional argument.

bertsky commented 1 year ago

But when we do that (setting the workspace) the wrapped instance of TesserocrRecognize does not get updated with the "new" workspace.

Ah, sry, I didn't see it! Yes, since we don't override the constructor, but construct and then wrap a separate processor object, many patterns don't work. So I would say this trick is bad practice and should be ruled out in the future.

IIRC I used that wrap pattern instead of overriding the constructor because I did not want the user to see the processor appear like ocrd-tesserocr-recognize (with all its complicated parameters). But the downsides are obviously too great.

I'll make a PR based on the simpler pattern.

The only solution I see is making workspace a property in the processor-base-class. And then override the setter in ocrd_tesserorc and transfer the newly set workspace to the wrapped instance of TesserocrSegmentRegion as well.

Yes, that would work, and maybe it's also the best pattern in general going forward for all processors. (The setter could do many other things, like syncing/closing log files, releasing locks, syncing the METS cache or METS server.)