Open tboenig opened 2 years ago
- the levels of ground truth
note: there is already a top-level processingLevel
– but I don't understand it TBH (an it's not used AFAICS)
- antiqua font with or without "ſ", with "ß" or "ſs" or "ss" (e.g. https://www.deutschestextarchiv.de/book/view/goethe_metamorphose_1790?p=10)
perhaps other systematic conventions in the writing system, like whether or not ꝛ
is present, umlauts (see below) etc
- legal/technical/commercial/linguistic/... Special characters
I don't know if these fit, but there are already domains under topic/...
- letter spaced and/or bold print
suggestion:
data-attributes/document-related/visual/text/font/bold-face
data-attributes/document-related/visual/text/font/letter-spaced
- umlaut äöü and/or aͤoͤuͤ
question: is it relevant to single out the rare case where both are present?
- old greek (greek diacritics...)
esp. whether polytonic or monotonic or old/pre-standard
All these character set-specific distinctions probably don't fit well into a flat tagging system...
Can you add something?
Perhaps under dataTransformation/...
one should have tags for
(I would not subsume these under enhancement
.)
thank`s @bertsky
The GT labeling metadata complements the existing metadata that is present in each Page file. The requirement is that the GT is in PAGE format. The page format provides for a large amount of metadata. Especially in the Text Style areas specific metadata can be assigned. TextStyle: https://ocr-d.de/en/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextRegionType.html#TextRegionType_TextStyle
to 1.
to 2. In Labeling, a distinction is only between Antiqua and Blackletter. Typographic forms or spellings should regulate the levels. For example the text is in Antiqua see example: text is without long s, labeling: level1, antiqua text is with long s, labeling: level2, antiqua
to 3. legal/technical/commercial/linguistic/... Special characters
to 4. Font with blocking and/or bold print This is to be implemented in Page. See https://ocr-d.de/en/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextRegionType.html#TextRegionType_TextStyle
to 5. Umlauts: umlauts as äöü and/or aͤoͤuͤ
to 6. Special orthographies and changes of spellings which were/are officially enacted (e.g. spelling reform in German, monotonic orthography,...).
@bertsky: I am missing some metadata for the following cases:
Can you add something? @bertsky