Closed artunit closed 4 years ago
That is actually extremely pertinent to my work right now. For basic manuscripts with completely straight, vertical/horizontal writing ALTO works quite well but anything more complex would be helped by a free-form baseline capability. hOCR limits the definition to a polynomial but a sequence of line segments is more appropriate for highly curled/circular lines.
The shape-element usage discussion might be useful to you, I used the bounding box coordinates from the Cloud Vision API but ALTO has allowed polygon, circle and ellipse shape types since version 3.1, and these are available down to the glyph level.
Stupid question: Does the POLYGON
shape define an open or a closed polygon? For baselines open would be more appropriate but the documentation doesn't elaborate on that point.
@mittagessen an "open polygon" is an oxymoron: a polygon is by definition "a closed plane figure bounded by three or more line segments." If what is meant is a series of points connected by line segments, maybe the name should be changed (not that I have an elegant suggestion).
@urieli Open polygonal chains are sometimes known as open polygons. The shortest unambiguous name would be polyline.
The easiest way would be to deal with this rather special case would be to extent the BASELINE
attribute to allow polylines instead of a single line segment. It would also keep the existing semantics of the shape elements.
@urieli, @mittagessen - I like the_BASELINE_suggestion. Technically, the schema doesn't distinguish between open and closed polygons, though the documentation does identify its use for bounding shapes. Issue 22 targets changing BASELINE to PointsType which I think would address this.
@artunit Changing BASELINE
to points type is exactly what I had in mind, although I am unsure if the change breaks backward compatibility unnecessarily. The old model just used a single y-coordinate, so the encoding differs even for perfectly straight baselines.
@mittagessen The schema does not currently annotate BASELINE
and I guess it would come down to whether existing implementations would be broken. A point is normally two coordinates though there could be the notion that one is implicit for single values in the annotation. The schema also has the notion of a typesetting point, or 1/72 of an inch, so it would probably be good to define the different uses of point
. In the same vein, PointsType
is defined as a list of points and I think it would be useful to allow these to be written as a list of pairs, e.g. instead of 200 400 203 405 210 420
, something like (200,400),(203,405),(210,420)
.
This issue seems to be addressed, ALTO is now used for encoding handwriting in two major projects (Transkribus and eScripta), and the change to BASELINE has been published in version 4.2 of the ALTO schema.
ALTO could have great value for handwriting representation. This is an initial example of what it might look like, I have taken the coordinates and confidence levels from the Cloud Vision API and its beta support for handwriting recognition, though have rounded the Glyph confidence numbers.