altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Position for rotated text #59

Closed silviu22 closed 5 years ago

silviu22 commented 5 years ago

I am a little confused about the coordinates and width/height of rotated text.

I believe the HPOS, VPOS are the (x,y) coordinates of the top-left corner of the text block. Also, the width/height seem to be the width of the bounding rectangle containing the whole text.

This seems obvious for normal (horizontal) text. Is this the case for rotated text as well?

For example, when text is rotated by 90 degrees, the old width becomes height and old height becomes width, To describe the question a little better, I came up with 4 cases:

Please take a look at this file: Text Position.pdf In that file, (x,y) is HPOS,VPOS for a particular word. And W/H is width/height of that word.

I believe the answers are as follows:

Note that if HPOS,VPOS is always top-left of the displayed text, then it has a different meaning for the program that is supposed to displaying such text.

Case D might be best to explain what I mean. To draw text at 45 degrees, you will typically tell the computer to draw text at 45 degrees starting from point P2 (the baseline). You will usually not tell it to display the text at point P4 (top-left corner). So I would have to do quite a lot of work to deduce point P2 that would draw text at 45 degrees that will have the top-left corner at point P4. It can be done, but it takes some work.

So, to recap, can someone confirm that the HPOS,VPOS for case D (text at 45 degrees) is point P4? Also, for point D (45 degrees), there are two possible pairs of values that can be considered width/height:

artunit commented 5 years ago

This is a good question and I am hoping one of my Board brethren with more experience with using ALTO with rotated blocks can weigh in. I believe that HPOS,VPOS are for the center of the block and then the rotation is applied when the rotation attribute is used.

bertsky commented 5 years ago

Please allow me to weigh in. I have just finished tackling those issues for PAGE-XML within OCR-D, where with pc:RegionType/@orientation exists a perfect analogue of @ROTATION, and all segments' @points are likewise absolute (always referencing the pixel xy coordinates of the page image source). See here for an in-depth discussion, and its implementations for Tesseract and for Ocropy.

(That discussion aims to solve not just the particular problem of rotation but the wider issues of relative coordinates when using binary image data for segments at each step in the hierarchy – blocks, lines, words –, which is possible to represent in PAGE-XML via pc:AlternativeImage/@filename. But that should be the same for ALTO-XML with its ComposedBlockType/@FILEID.)

Let me start off my answer with a quote from the spec. In the xsd:documentation of @ROTATION (my emphasis, this is also mentioned specifically in the changelog), we have:

Tells the rotation of e.g. text or illustration within the block.

So this merely informs about the skew of the binary image data within the annotated region (being described by the bounding box with @HPOS / @VPOS / @HEIGHT / @WIDTH or by a polygon with Shape/POINTS). Naturally, the bbox will have to be larger than the actual block's outline if it is rotated. Using a polygon would always be a more precise alternative representation.

Therefore, yes @silviu22, your block D has its HPOS/VPOS at P4 and its WIDTH/HEIGHT is W2/H2 (referring to your drawing – your verbalization is somewhat unfortunate, because it describes W2/H2 as the width/height of the smallest rectangle of the rotated text; surely, you are referring to "rotated" as rotated in the image, but that's usually called "skewed", whereas "rotated" is the respective countermeasure).

This is not a big deal. It only starts to get complicated when we extract and annotate a binary, cropped and deskewed image for the block (via @FILEID), and offer this to the next lower segmentation: now the runtime coordinate system is relative to that (skewed) block, and has to be converted back to absolute before writing the segmentation results. To do that, coordinates have to be rotated back, and shifted by the offset of the parent block.

It is in this detail that the comment by @artunit makes some sense, but happens to be wrong:

I believe that HPOS,VPOS are for the center of the block and then the rotation is applied when the rotation attribute is used.

No, HPOS / VPOS are always the top-left corner of blocks. But indeed, deskewing does rotate around the center of the region, at (HPOS+0.5*WIDTH) / (VPOS+0.5*HEIGHT). That just means the above mentioned compensatory (passive) rotation of coordinates has to be accompanied by translation (from the top-left corner to the center) before rotation and back-translation (from the center to the top-left corner) afterwards.

@cneud as you can see I answered myself here.

artunit commented 5 years ago

Thanks @bertsky, @Jo-CCS pointed out my error at the last Board meeting. I was sure I had read this somewhere but the schema is indeed the definitive word on this.

silviu22 commented 5 years ago

Thank you @bertsky and @artunit for the clarification.

To me, skewed text was distorted text, like italic text. (I assumed skewed text is still written horizontally, but you drag the top of the text left or right by a certain amount, the same way the italic text leans to the right). But if you prefer the term "skewed" instead of "rotated", that is fine with me.

There will be a good amount of calculations to find a way to draw this skewed text using the coordinates of the bounding rectangle. But this is fine as long as it's clear.