Open bertsky opened 3 years ago
8c18d4b implements mapping PAGE
@type
attribtues to ALTOLayoutTag/@LABEL
.
Excellent!
I thought just using ALTO's BlockType/@TYPE
would be enough for PAGE's various regions' @type
. But TagType
looks better I must admit. Just a few comments:
LayoutTag
and not StructureTag
?@type
as @TYPE
and the elaborate tagging?Page/@type
vs Layout/Page/@PAGECLASS
via the same mechanism?)What about PAGE's Label
mechanism though? Looks as though it is somewhat equivalent to ALTO's TagsType
and @TAGREFS
... Perhaps via OtherTag
?
Layers: I cannot find any mechanism for expressing z-level in ALTO.
IMHO you could express it as as StructureTag
with @ID
for @id
and @LABEL
for @zIndex
– but I don't know if this is of any use/relevance for anyone.
As for relations I also doubt it can be easily mapped, at least I don't see how :(
From this recommendation it looks like drop-cap relations should be represented via LayoutTag
. Not sure about follow-up regions though.
Why LayoutTag and not StructureTag?
I was unsure myself and let @cneud be the tiebreaker :) I don't really know the difference tbh.
Perhaps one could do both kinds of mappings, a verbatim copy of @type as @TYPE and the elaborate tagging?
I did not realize that ALTO has @TYPE
. Being redundant here for implementations that use either mechanism makes sense.
How about including Page/@type vs Layout/Page/@PAGECLASS via the same mechanism?)
:+1:
What about PAGE's Label mechanism though?
Sure, I can have a look. Do you have an example?
IMHO you could express it as as StructureTag with @ID for @id and @LABEL for @zIndex – but I don't know if this is of any use/relevance for anyone.
Sure, why not. Again, an example would help with testing.
From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.
IIUC the example cited is not a relation from drop-cap to region but just tagging that this alto:String
is a DropCap
with content A
(which seems unneccessary). We could use a hack with @ID
being the source and @LABEL
or @DESCRIPTION
being the target region. It would be better than losing that information for sure.
I did not realize that ALTO has
@TYPE
. Being redundant here for implementations that use either mechanism makes sense.
I concur.
What about PAGE's Label mechanism though?
Sure, I can have a look. Do you have an example?
Pass (again), sorry. I have grepped through all my PAGE-XML GT resources (which includes various datasets from PRImA), but have not found anything on Labels
or Relation
. (But the latter is in some of OCR-D structure GT IIRC.)
It's quite expressive: you can have Labels
under MetadataItem
, all segment hierarchy types from Page
to Glyph
, all ReadingOrder group types, and even Relation
. We should probably open an issue and demand more documentation/examples.
From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.
IIUC the example cited is not a relation from drop-cap to region but just tagging that this
alto:String
is aDropCap
with contentA
(which seems unneccessary).
You're right – it looked more promising at the first glance.
So we do need a representation for link
vs join
. PAGE's schema-internal documentation reads as if this should apply on different hierarchy levels, but I cannot find a single GT example.
I would expect:
drop-capital
vs paragraph
:
Word
itself is a whole word (i.e. is to be delimited by white space)join
(i.e. no extra line break or paragraph break)paragraph
vs paragraph
:
Word
of the first is continued in the second (i.e. is to be de-hyphenated)Word
of the first is continued on the second (i.e. is to be de-hyphenated)But with ALTO we already have an explicit white-space model – on the line level. So I guess you could argue keeping a SP
after the final String
could represent link
(as opposed to join
). But that would just be a convention, and I doubt anyone already uses it. Also, for the third case, we don't know how much use ALTO producers/consumers make of HYP
and of String/@SUBS_TYPE
(HypPart1
and HypPart2
). And beyond that we still need to mark paragraph joins (non-breaks).
I was curious how TEI converters handle this. Sifting through with https://github.com/cneud/ocr-conversion and https://github.com/altoxml/documentation/wiki/Software …
@SUBS_TYPE=HypPart1
with @SUBS_CONTENT
I cannot believe there is no existing ALTO-TEI converter capable of unwrapping lines and concatenating text into linear sequence (based on reading order and block/paragraph bounaries). :frowning:
We could use a hack with
@ID
being the source and@LABEL
or@DESCRIPTION
being the target region. It would be better than losing that information for sure.
Not sure anymore we strictly need a relation type (see above: probably just a marker for "join-with-next" on various levels)...
8c18d4b implements mapping PAGE
@type
attribtues to ALTOLayoutTag/@LABEL
.Layers: I cannot find any mechanism for expressing z-level in ALTO.
As for relations I also doubt it can be easily mapped, at least I don't see how :(