OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Respect alternative image (if present) #33

Closed wrznr closed 5 years ago

wrznr commented 5 years ago

According to the OCR-D functional model, binarization can take place prior to block and line segmentation. Both processing steps should use the alternative image (if present).

bertsky commented 5 years ago

Indeed. Also applies to later processing steps, i.e. image manipulation steps operating on regions and lines. The only way to reference these additional image files is via AlternativeImage from some PAGE. So all later steps must query these before (and instead of) they SetRectangle on the element's coords: Not just Crop/SegmentRegion on PageType, but also SegmentLine on TextRegionType, SegmentWord on TextLineType, and Recognize on its textequiv_level element.

If there is ambiguity (multiple alternative images available), maybe we should define rules to choose? We already have rules for comments. Now we have to specify which comments are preferable/expected at which step.

bertsky commented 5 years ago

@wrznr and I have given this some thought:

There are preprocessing steps that must create new image data (because there is no other way to represent their result), like despeckling, dewarping and binarization. There are also steps that can, but could also just annotate the PAGE with enough information for later steps to apply them, e.g. deskewing (via @orientation) and cropping (via Coords/@points). And sometimes, that depends on the hierarchy level: e.g. deskewing angle can only be annotated on TextRegion (as TextLine and Page have no @orientation).

But whatever the level, when descending to a lower level, all the annotated image preprocessing should be applied, because otherwise it would have to be repeated in all the constituent elements during the next step.

Therefore, while generally it is for the processor to decide whether or not to create new image data, at the last step per level (typically binarization) it must be configured to do so. And every processor must be programmed to respect image data (AlternativeImage) for its respective level (or higher in the hierarchy) if referenced in the input PAGE. Since each step produces a new PAGE from the old one, there is no (valid use-case of) ambiguity – one can always take the last AlternativeImage (and the @comments are purely cosmetic).

So PAGE+METS allows a very flexible generic workflow design. However, there is a subtelty in coordinate calculations involved here: Since PointsType (anywhere from BorderType down to any segment's CoordsType) is required to be relative to the root PageType/@imageFilename image, but AlternativeImage generally does not retain coordinates, one cannot simply derive lower-level image data for an element by cutting the parent image in the hierarchy at its coordinates. Instead, each implicit coordinate transform must be explicitly passed down along with AlternativeImage so it can be compensated in lower-level coordinate calculations.

And obviously, this would be difficult to do (and even more difficult to annotate) with non-linear transforms like dewarping. It is easier to live with that if dewarping is done on the line level (when only vertical coordinates will be off for words and glyphs) than on the page level.

But for linear transforms this can be done easily:

  1. when cropping, calculate the offset of the lower-level segments
  2. (if the AlternativeImage is larger than annotated, as happens during deskewing/rotation because the image has to be expanded/reshaped, then decrease the offset x / y by half the difference in width / height)
  3. when deskewing: also rotate the (polygon) coordinates, but by passive rotation (inverse transform), and compensate for the fact that image rotation is centered in the image (hence for coordinates, center translation, pure rotation, back-translation)

I have implemented this for ocropy first. Functions in ocrd_cis.ocropy.common like image_from_page, image_from_region, image_from_line and save_image_file should probably be moved into ocrd.workspace.Workspace and recommended for all processors. But before that I want to re-integrate this architecture here and see if the solution is general enough...

bertsky commented 5 years ago

@kba @chreul What do you think?

With permission from @wrznr I add this general workflow diagram for illustration of preprocessing options.

bertsky commented 5 years ago

principle

So, to rephrase the "subtelty": we have a principle at work here which states that coordinates within any AlternativeImage (on whatever level) must be reproducible, i.e. the annotation present in an element that contains AlternativeImage and upwards the hierarchy must always be sufficient to calculate the pixel position in the image from the pixel position in PageType/@imageFilename (e.g. when cropping components further down the hierarchy) or vice versa (e.g. when adding elements further down the hierarchy).

problems

This reproducibility priniple is currently jeopardized (in concept) by two problems:

  1. dewarping, especially on the page or region level (as mentioned above)
  2. rescaling: processors (annotation producers or consumers) might want to reference images that are upscaled or downscaled to fit the needs of their recognition model or binarization algorithm. We currently do not have this in the spec, but I think we really should.

dewarping

Now, as for 1, we could try to define a parametric field equivalent (within reasonable accuracy) to any conceivable binary dewarping transform. For example, let's assume the Leptonica approach has sufficient generality. It defines the transform as a vertical and horizontal disparity field, which is basically a (quadratic) parametric function of points interpolated between equidistant intervals. This can be described as two vectors each.

So all we need is an attribute in PAGE for this, and consumers willing to perform the compensatory calculations on all coordinates after and below dewarping. We could of course use @custom again.

Or could we perhaps use GridType for this?

rescaling

Regarding problem 2, we now face the problem that a difference between actual binary size of AlternativeImage and size of the Coords/@points rectangle of the element can be caused by either rotation and rescaling, which amounts to either offset correction or scaling back. So we need a way to disambiguate this. Using @comments to check if either is the case is not sufficent though: it could also be both! So again, we either resort to @custom, or we need a new attribute in PAGE, let's say AlternativeImage/@scale with xsl:attribute/@default="1.0". A third option would be to prohibit rescaling as a valid use case altogether.

@chris1010010 what do you think? (BTW, introcuding @orientation on the page level can also be seen as a means to ensure the AlternativeImage reproducibility principle.)

bertsky commented 5 years ago

BTW, @cneud how does ALTO deal with this? Is ComposedBlockType the equivalent of AlternativeImage, or how do you specify binary image data (interim) results below the page level?

bertsky commented 5 years ago

Oh, while we are at it: there are two more points which might need disambiguation:

A. If a region (or page) has non-zero @orientation and an AlternativeImage is present, do we expect the image to be deskewed already, or do processors always produce a @comments string with the correct classification, i.e. including deskewed? B. If a page has Border and an AlternativeImage is present, do we expect the image to be cropped already, or do processors always produce a @comments string with the correct classification, i.e. including cropped?