OCR-D / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
13 stars 5 forks source link

Label, Layers and Relation #4

Open bertsky opened 3 years ago

kba commented 3 years ago

8c18d4b implements mapping PAGE @type attribtues to ALTO LayoutTag/@LABEL.

Layers: I cannot find any mechanism for expressing z-level in ALTO.

As for relations I also doubt it can be easily mapped, at least I don't see how :(

bertsky commented 3 years ago

8c18d4b implements mapping PAGE @type attribtues to ALTO LayoutTag/@LABEL.

Excellent!

I thought just using ALTO's BlockType/@TYPE would be enough for PAGE's various regions' @type. But TagType looks better I must admit. Just a few comments:

  1. Why LayoutTag and not StructureTag?
  2. Perhaps one could do both kinds of mappings, a verbatim copy of @type as @TYPE and the elaborate tagging?
  3. How about including Page/@type vs Layout/Page/@PAGECLASS via the same mechanism?)

What about PAGE's Label mechanism though? Looks as though it is somewhat equivalent to ALTO's TagsType and @TAGREFS... Perhaps via OtherTag?

Layers: I cannot find any mechanism for expressing z-level in ALTO.

IMHO you could express it as as StructureTag with @ID for @id and @LABEL for @zIndex – but I don't know if this is of any use/relevance for anyone.

As for relations I also doubt it can be easily mapped, at least I don't see how :(

From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.

kba commented 3 years ago

Why LayoutTag and not StructureTag?

I was unsure myself and let @cneud be the tiebreaker :) I don't really know the difference tbh.

Perhaps one could do both kinds of mappings, a verbatim copy of @type as @TYPE and the elaborate tagging?

I did not realize that ALTO has @TYPE. Being redundant here for implementations that use either mechanism makes sense.

How about including Page/@type vs Layout/Page/@PAGECLASS via the same mechanism?)

:+1:

What about PAGE's Label mechanism though?

Sure, I can have a look. Do you have an example?

IMHO you could express it as as StructureTag with @ID for @id and @LABEL for @zIndex – but I don't know if this is of any use/relevance for anyone.

Sure, why not. Again, an example would help with testing.

From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.

IIUC the example cited is not a relation from drop-cap to region but just tagging that this alto:String is a DropCap with content A (which seems unneccessary). We could use a hack with @ID being the source and @LABEL or @DESCRIPTION being the target region. It would be better than losing that information for sure.

bertsky commented 3 years ago

I did not realize that ALTO has @TYPE. Being redundant here for implementations that use either mechanism makes sense.

I concur.

What about PAGE's Label mechanism though?

Sure, I can have a look. Do you have an example?

Pass (again), sorry. I have grepped through all my PAGE-XML GT resources (which includes various datasets from PRImA), but have not found anything on Labels or Relation. (But the latter is in some of OCR-D structure GT IIRC.)

It's quite expressive: you can have Labels under MetadataItem, all segment hierarchy types from Page to Glyph, all ReadingOrder group types, and even Relation. We should probably open an issue and demand more documentation/examples.

From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.

IIUC the example cited is not a relation from drop-cap to region but just tagging that this alto:String is a DropCap with content A (which seems unneccessary).

You're right – it looked more promising at the first glance.

So we do need a representation for link vs join. PAGE's schema-internal documentation reads as if this should apply on different hierarchy levels, but I cannot find a single GT example.

I would expect:

But with ALTO we already have an explicit white-space model – on the line level. So I guess you could argue keeping a SP after the final String could represent link (as opposed to join). But that would just be a convention, and I doubt anyone already uses it. Also, for the third case, we don't know how much use ALTO producers/consumers make of HYP and of String/@SUBS_TYPE (HypPart1 and HypPart2). And beyond that we still need to mark paragraph joins (non-breaks).

I was curious how TEI converters handle this. Sifting through with https://github.com/cneud/ocr-conversion and https://github.com/altoxml/documentation/wiki/Software

I cannot believe there is no existing ALTO-TEI converter capable of unwrapping lines and concatenating text into linear sequence (based on reading order and block/paragraph bounaries). :frowning:

We could use a hack with @ID being the source and @LABEL or @DESCRIPTION being the target region. It would be better than losing that information for sure.

Not sure anymore we strictly need a relation type (see above: probably just a marker for "join-with-next" on various levels)...