OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

Expose image region extraction code as CLI #264

Open kba opened 5 years ago

kba commented 5 years ago

For implementations using the bashlib API, it would be useful to have a command to extract a specific block/line from a PAGE, e.g.

ocrd workspace extract --element-id="line123" --page-id="page321" --output=line123.tiff

c.f. https://github.com/OCR-D/ocrd_olena/pull/2#discussion_r304248793

bertsky commented 5 years ago

But we first need the extraction Python API in core. Currently, it is only in common of OCR-D/ocrd_tesserocr, right? (related: OCR-D/ocrd_tesserocr#56)

bertsky commented 5 years ago

Also, bashlib needs to offer more than just extraction: we also have to create PAGE-XML output with Olena binarization (referencing the new file in AlternativeImage and in the METS). So we have to shell-wrap the generateDS data model.

Is that possible with Click by any chance? Or would it have to be a general OOP extension? Maybe something like Skull, which already can do general XML deserialization like so – but the current implementation is a mess (it's just sed expressions!)

kba commented 5 years ago

Is that possible with Click by any chance? Or would it have to be a general OOP extension?

Much rather prefer making the python CLI more flexible than to use a shell framework. Not that I don't love those but cost/reward is off at this point.

So we have to shell-wrap the generateDS data model.

That is also very ambitious. It's probably doable to generate a generic CLI frontend to generateDS code but a lot of effort. Probably substantially easier to offer a limited set of options, like thos of the workspace subcommand.

bertsky commented 5 years ago

I fully agree.

FYI, for Olena I am in the middle of a PR that will allow querying from and appending to PAGE's AlternativeImage with xmlstarlet – tentatively only one the Page level, but the other hierarchy levels are doable. The main obstacle was/is the namespace problem:

  1. capturing more than one namespace version
  2. dealing with namespace default vs namespace prefix (both are possible)
  3. migrating to newest version

If this is successful, it could easily be generalized to core.

bertsky commented 5 years ago

@kba, see bashlib-related FIXMEs in OCR-D/ocrd_olena#5

EEngl52 commented 4 years ago

@kba @bertsky can this be closed?

bertsky commented 4 years ago

can this be closed?

I don't think so. We should at least re-visit.

What we have now as a proof of concept in ocrd-olena-binarize and ocrd-im6convert is based on the xmlstarlet CLI – essentially a (partial) re-implementation. This is hard to modify/use/transfer, even if it were integrated into bashlib. (I guess any shell code is hard to make into a good API without non-trivial data structures.)

If we had a generic image processor in Python, we could probably reduce the need for that shell API greatly.

bertsky commented 3 years ago

If we had a generic image processor in Python, we could probably reduce the need for that shell API greatly.

We do have that (as part of ocrd_wrap's ocrd-preprocess-image), but I do think that it could be worthwhile in itself to provide the exact interface @kba proposed in the opening comment (as a shell layer around Workspace.image_from_page and Workspace.image_from_segment). At the moment bashlib processors only partially address this part (by cropping coarsely with IM). So it's actually complementary to the PAGE-XML / xmlstarlet aspect we discussed in between.

kba commented 3 years ago

@bertsky Considering how ocrd_olena and ocrd_fileformat are the two only notable bashlib processors and you impemented ocrd-segment-extract-* and polygons would still have to be serialized as images (making this also a coarse approximation), is there still merit in such a CLI? If yes, let's agree on the interface and implement it. If not, let's close.

bertsky commented 3 years ago

There's also ocrd_im6convert. And I don't think we should restrict bashlib in any way just because there are no more processors using it right now. In fact, I think it's precisely because bashlib is still too impoverished (so you have to go such lengths via xmlstarlet as described above, and recently even the input_files iterator problem came on top) that it has not been used more.

It's true we don't really need image extraction as such – ocrd-segment-extract-* already do that.

What we do need here is an API that allows getting a (polygon-masked, via alpha channels and/or bg fill) segment image via the same ad-hoc creation or AlternativeImage retrieval algorithm in the Python API, including filters and selectors. And more than that, output all information necessary for coordinate transformations, too.

bertsky commented 2 years ago

What we do need here is an API that allows getting a (polygon-masked, via alpha channels and/or bg fill) segment image via the same ad-hoc creation or AlternativeImage retrieval algorithm in the Python API, including filters and selectors. And more than that, output all information necessary for coordinate transformations, too.

Just to make this more clear: the difficulty here is in making the API calls (usually cascaded like image_from_page → image_from_segment → image_from_segment) re-entrant on the shell. For large memory objects like images, we can probably use (temporary) files. So, wrapping the file ID and segment ID to read and the image file name to write is no problem, as are the extra parameters. Even the parent image (file name) would be doable, but (returning and passing) the parent coords is hard, because it would have to be a single (re-useable) string that serializes all the information (transform array, angle float, features string).