Open kba opened 5 years ago
But we first need the extraction Python API in core. Currently, it is only in common
of OCR-D/ocrd_tesserocr, right? (related: OCR-D/ocrd_tesserocr#56)
Also, bashlib needs to offer more than just extraction: we also have to create PAGE-XML output with Olena binarization (referencing the new file in AlternativeImage
and in the METS). So we have to shell-wrap the generateDS data model.
Is that possible with Click by any chance? Or would it have to be a general OOP extension? Maybe something like Skull, which already can do general XML deserialization like so – but the current implementation is a mess (it's just sed
expressions!)
Is that possible with Click by any chance? Or would it have to be a general OOP extension?
Much rather prefer making the python CLI more flexible than to use a shell framework. Not that I don't love those but cost/reward is off at this point.
So we have to shell-wrap the generateDS data model.
That is also very ambitious. It's probably doable to generate a generic CLI frontend to generateDS code but a lot of effort. Probably substantially easier to offer a limited set of options, like thos of the workspace
subcommand.
I fully agree.
FYI, for Olena I am in the middle of a PR that will allow querying from and appending to PAGE's AlternativeImage
with xmlstarlet – tentatively only one the Page
level, but the other hierarchy levels are doable. The main obstacle was/is the namespace problem:
If this is successful, it could easily be generalized to core.
@kba, see bashlib-related FIXMEs in OCR-D/ocrd_olena#5
@kba @bertsky can this be closed?
can this be closed?
I don't think so. We should at least re-visit.
What we have now as a proof of concept in ocrd-olena-binarize
and ocrd-im6convert
is based on the xmlstarlet
CLI – essentially a (partial) re-implementation. This is hard to modify/use/transfer, even if it were integrated into bashlib. (I guess any shell code is hard to make into a good API without non-trivial data structures.)
If we had a generic image processor in Python, we could probably reduce the need for that shell API greatly.
If we had a generic image processor in Python, we could probably reduce the need for that shell API greatly.
We do have that (as part of ocrd_wrap's ocrd-preprocess-image
), but I do think that it could be worthwhile in itself to provide the exact interface @kba proposed in the opening comment (as a shell layer around Workspace.image_from_page
and Workspace.image_from_segment
). At the moment bashlib processors only partially address this part (by cropping coarsely with IM). So it's actually complementary to the PAGE-XML / xmlstarlet aspect we discussed in between.
@bertsky Considering how ocrd_olena and ocrd_fileformat are the two only notable bashlib processors and you impemented ocrd-segment-extract-*
and polygons would still have to be serialized as images (making this also a coarse approximation), is there still merit in such a CLI? If yes, let's agree on the interface and implement it. If not, let's close.
There's also ocrd_im6convert. And I don't think we should restrict bashlib in any way just because there are no more processors using it right now. In fact, I think it's precisely because bashlib is still too impoverished (so you have to go such lengths via xmlstarlet as described above, and recently even the input_files
iterator problem came on top) that it has not been used more.
It's true we don't really need image extraction as such – ocrd-segment-extract-*
already do that.
What we do need here is an API that allows getting a (polygon-masked, via alpha channels and/or bg fill) segment image via the same ad-hoc creation or AlternativeImage retrieval algorithm in the Python API, including filters and selectors. And more than that, output all information necessary for coordinate transformations, too.
What we do need here is an API that allows getting a (polygon-masked, via alpha channels and/or bg fill) segment image via the same ad-hoc creation or AlternativeImage retrieval algorithm in the Python API, including filters and selectors. And more than that, output all information necessary for coordinate transformations, too.
Just to make this more clear: the difficulty here is in making the API calls (usually cascaded like image_from_page → image_from_segment → image_from_segment
) re-entrant on the shell. For large memory objects like images, we can probably use (temporary) files. So, wrapping the file ID and segment ID to read and the image file name to write is no problem, as are the extra parameters. Even the parent image (file name) would be doable, but (returning and passing) the parent coords is hard, because it would have to be a single (re-useable) string that serializes all the information (transform
array, angle
float, features
string).
For implementations using the bashlib API, it would be useful to have a command to extract a specific block/line from a PAGE, e.g.
c.f. https://github.com/OCR-D/ocrd_olena/pull/2#discussion_r304248793