OCR-D / gt-labelling

1 stars 0 forks source link

missing metadata: e.g. for leveling, antiqua, special characters, letter spaced, umlaut, old greek #4

Open tboenig opened 2 years ago

tboenig commented 2 years ago

@bertsky: I am missing some metadata for the following cases:

Can you add something? @bertsky

bertsky commented 2 years ago
  • the levels of ground truth

note: there is already a top-level processingLevel – but I don't understand it TBH (an it's not used AFAICS)

perhaps other systematic conventions in the writing system, like whether or not is present, umlauts (see below) etc

  • legal/technical/commercial/linguistic/... Special characters

I don't know if these fit, but there are already domains under topic/...

  • letter spaced and/or bold print

suggestion:

  • umlaut äöü and/or aͤoͤuͤ

question: is it relevant to single out the rare case where both are present?

  • old greek (greek diacritics...)

esp. whether polytonic or monotonic or old/pre-standard

All these character set-specific distinctions probably don't fit well into a flat tagging system...

Can you add something?

Perhaps under dataTransformation/... one should have tags for

(I would not subsume these under enhancement.)

tboenig commented 2 years ago

thank`s @bertsky

The GT labeling metadata complements the existing metadata that is present in each Page file. The requirement is that the GT is in PAGE format. The page format provides for a large amount of metadata. Especially in the Text Style areas specific metadata can be assigned. TextStyle: https://ocr-d.de/en/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextRegionType.html#TextRegionType_TextStyle

to 1.

to 2. In Labeling, a distinction is only between Antiqua and Blackletter. Typographic forms or spellings should regulate the levels. For example the text is in Antiqua see example: text is without long s, labeling: level1, antiqua text is with long s, labeling: level2, antiqua

to 3. legal/technical/commercial/linguistic/... Special characters

to 4. Font with blocking and/or bold print This is to be implemented in Page. See https://ocr-d.de/en/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextRegionType.html#TextRegionType_TextStyle

to 5. Umlauts: umlauts as äöü and/or aͤoͤuͤ

to 6. Special orthographies and changes of spellings which were/are officially enacted (e.g. spelling reform in German, monotonic orthography,...).