UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Discussion: Transcription of decimal dot in numbers #9

Open wollmers opened 4 years ago

wollmers commented 4 years ago

There are some numbers in the original images, where the decimal dot is not sitting near the baseline. Either at the hight of the hyphen, or at the top edge (height of capitals).

IMHO for broader use of the GT files (OCR training, benchmark) an intermediate transcription should be used, i. e. Unicode without PUA and as near as possible to the original glyphs (long s), spelling etc. Conversion into basic level (current spelling, German keyboard) is easier than conversion in the other direction.

What dots are available in Unicode:

     cpoint  name
'.'  U+002E  FULL STOP (Other_Punctuation)
'·'  U+00B7  MIDDLE DOT (Other_Punctuation)

'˙'  U+02D9  DOT ABOVE (Modifier_Symbol)
'·'  U+0387  GREEK ANO TELEIA (Other_Punctuation)
'᛫'  U+16EB  RUNIC SINGLE PUNCTUATION (Other_Punctuation)
'․'  U+2024  ONE DOT LEADER (Other_Punctuation)
'‧'  U+2027  HYPHENATION POINT (Other_Punctuation)
'∙'  U+2219  BULLET OPERATOR (Math_Symbol)
'⋅'  U+22C5  DOT OPERATOR (Math_Symbol)
'⸱'  U+2E31  WORD SEPARATOR MIDDLE DOT (Other_Punctuation)
'⸳'  U+2E33  RAISED DOT (Other_Punctuation)
'・' U+30FB  KATAKANA MIDDLE DOT (Other_Punctuation)
'ꞏ'  U+A78F  LATIN LETTER SINOLOGICAL DOT (Other_Letter)

MIDDLE DOT appears frequently in current and old typography and is available in most fonts.

But I hesitate to use DOT ABOVE, because it's a modifier symbol. We can use it now and maybe convert later after consulting some opinions.