altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Change BASELINE to accommodate a list of points in addition to a single point #32

Closed jpmoreux closed 4 years ago

jpmoreux commented 9 years ago

Günter Mühlberger and Structify colleagues at University of Innsbruck would like to request using a list of points for the BASELINE instead of one single point. So changing from

< xsd:attribute name="BASELINE" type="xsd:float" use="optional"/> to < xsd:attribute name="BASELINE" type="PointsType" use="optional"/>

Moreover, for handwritten text it could be useful to have more than one BASELINE for a single text line, e.g when a text was crossed and overwritten.

The first marked text below shows a line with logically two base lines. The word above the line belongs logically to the same line. So this is the reason why we would like to have several base lines for one line.

The marked text number 2 shows why the baseline realised as polyline is such important when dealing with handwritten or distorted text.

altobaselinerequest

artunit commented 7 years ago

Could BASELINE use SHAPE on the same way as STRING, etc,. with the addition of POLYLINE to ShapeType?

artunit commented 6 years ago

I realised that my comment sort of skipped a major implication, that BASELINE becomes an element instead of an attribute. As I tried working though this more by hand, I see the relationship between TEXTLINE, BASELINE and STRING much better. If we have something like:

That would allow multiple BASELINES and capture the geometry of the line(s). If someone was calculating typographic/writing constructs like "descenders", it might be important to capture the thickness of the baseline, that's where the SHAPE question came in, but that's probably not necessary in lieu of an actual request.

artunit commented 6 years ago

Just a quick update on this, I think we should go with the original request for now, i.e., use:

It is true that there are many occasions where handwritten materials have multiple lines, but it gets really dicey when the second line stops half-way through a character and I am wondering if it's really a BASELINE at that point. I suspect that the "descender" aspect might come forward at some point but maybe we can avoid the weeds on some of this by starting with the smallest change and working from there.

Jo-CCS commented 6 years ago

I disbelieve that the first sample marked with "Ex. 1" is a good sample as I wonder if this should not be described as separate "TextLine" object", but anyway as outlined on the second example I it is well seen that the line is not excact horizontal and it need to be possible to describe even curved or sloped lines.

Further more I would like to outline that an annotation is completely missing for this attribute and need to outline the exact intension of usage / value. From the other sample of Jean-Philip uploaded here it was filled with the value of the distance to the top of the page. I would expect that for the PointType it should be kept like this to have the absolute coordinates to the page top/left corner as the other coordinates and no relative values to the TextLine object.

artunit commented 6 years ago

Sorry to be so slow on this, Jo, I didn’t get pinged by email when your comment was added and missed this until now. In typesetting, the baseline is the line upon which a line of text rests. The concept carries over to handwriting but it can blur on inspection with things like underlining for emphasis (which is what I think is happening in Ex. 2, where the baseline intersects with the underline). I think it's technically not always possible to determine what is baseline in handwriting but the coordinates can define a line that seems to fulfill this function. To make this explicit:

<xsd:attribute name="BASELINE" type="PointsType" use="optional">
<xsd:annotation>
<xsd:documentation>A single line on which a line of text rests.</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
artunit commented 6 years ago

As per the Mar. 12, 2018 meeting, we use the typographic interpretation of BASELINE and define the coordinate orientation:

<xsd:attribute name="BASELINE" type="PointsType" use="optional">
   <xsd:annotation>
      <xsd:documentation>
         Pixel coordinates based on the left-hand top corner of an image 
         which define a single line on which a line of text rests.
      </xsd:documentation>
   </xsd:annotation>
</xsd:attribute>
urieli commented 6 years ago

It's not clear to me which version of the schema is being discussed here. In version 4.0, we have only: https://github.com/altoxml/schema/blob/master/v4/alto-4-0.xsd#L906

However, there is no documentation to indicate what this float value represents: is it the vertical coordinate of the baseline at the TextLine's HPOS? Presumably, the containing TextBlock's ROTATION then makes it possible to deduce a line spanning the entire TextLine.

Changing this to a PointsType should make it possible to define a line explicitly (although PointsType documentation doesn't tell us how to encode the points as a string).

Note: I can easily imagine a baseline with more than two points, for book pages that were not scanned "flat", so that the text curves upwards near the inner margin.

artunit commented 6 years ago

The syntax from the comment on Mar. 20 is the latest attempt to address this issue and is not part of a schema yet. The syntax has to be agreed to by the ALTO Board before becoming part of a version. and is still open to discussion. We are moving towards using PointsType, which is used for several SHAPES now. The syntax I have seen is the x coordinate followed by the y coordinate, e.g. 200 400 203 405 210 420, where something like [[200, 400], [203, 405], [210, 420]] or maybe (200,400),(203,405),(210,420) would be more explicit. This might be worth pursuing as a separate issue.

artunit commented 6 years ago

Gah, I didn't mean to close this, reopened and I need more coffee!

mittagessen commented 5 years ago

Is there a way to get this on the pipeline for inclusion in the next ALTO version?

artunit commented 5 years ago

@mittagessen - I think so, it was generally agreed to at the Board level.

artunit commented 5 years ago

As per the 2019-09-27 Board Meeting, the proposed attribute will be changed to use a polyline. This also means the issue is available for voting ACCEPT or REJECT.

<xsd:attribute name="BASELINE" type="PointsType" use="optional">
   <xsd:annotation>
      <xsd:documentation>
         Pixel coordinates based on the left-hand top corner of an image 
         which define a polyline on which a line of text rests.
      </xsd:documentation>
   </xsd:annotation>
</xsd:attribute>
artunit commented 5 years ago

ACCEPT

cowboyMontana commented 4 years ago

ACCEPT

splet commented 4 years ago

ACCEPT

cneud commented 4 years ago

ACCEPT

ntra00 commented 4 years ago

ACCEPT

jukervin commented 4 years ago

ACCEPT

cipriandinu commented 4 years ago

ACCEPT

hanyelsawy commented 4 years ago

ACCEPT

Ra1phM commented 4 years ago

ACCEPT

jpmoreux commented 4 years ago

ACCEPT

artunit commented 4 years ago

Slated for 4.2, will close when published.

artunit commented 4 years ago

Added in v4.2, released August 2020.