Closed wollmers closed 3 years ago
At the moment all Fraktur J before a vocal are transcribed as I in the text files.
Where did you find such patterns? I only found a few "Ia".
AFAIK GT4Hist keeps J.
Yes, the original GT4HistOCR normalized all upper case I to J. That not only looks strange for Roman numerals, but also is wrong for some texts which don't use Fraktur fonts.
Therefore I started to restore upper case I and added it also for Fraktur texts, even when there was no visible difference to J.
The resulting models then normally recognize I and J, so a full text search can look for "Juli" and "Indien".
At the moment all Fraktur J before a vocal are transcribed as I in the text files.
Where did you find such patterns? I only found a few "Ia".
Sorry, I meant consonants. The "Ia" are correct and stand for "1a" with a Roman numeral.
AFAIK GT4Hist keeps J.
Yes, the original GT4HistOCR normalized all upper case I to J. That not only looks strange for Roman numerals, but also is wrong for some texts which don't use Fraktur fonts.
AFAIK J or j isn't used in Roman numerals in this corpus. That was usual sometimes before 1700.
Therefore I started to restore upper case I and added it also for Fraktur texts, even when there was no visible difference to J.
The resulting models then normally recognize I and J, so a full text search can look for "Juli" and "Indien".
OK, then I keep them like this:
AFAIK J or j isn't used in Roman numerals in this corpus.
It was heavily used in 19th century texts from DTA, but there is also an example from 1588. See https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/-/commit/fbe5f5c127b3dbd3bfef2924c643d5a46bb8e725
OK, DTA is wrong (got a long list):
GT4HistOCR/corpus/dta19/1854-candidus_christus/00104.gt.txt:— VJJJ —
2 errors: no em-dash in the png, and no JJJ.
DTA has also some books of Blumenbach in their corpus, which came from Blumenbach-Online transcribed with round s, current alphabet, but old orthography. But they are not in the GT4HistOCR selection. At least I need to build my own GT data to cover the domain of natural history 1750-1900.
I mean something like this with jjj (1481 by William Caxton):
At the moment all Fraktur J before a vocal are transcribed as I in the text files.
I am not sure if it harms the quality of training. Maybe not, because training takes the adjacent characters into account.
On the other hand I myself prefer J in the image as J in the result. Most Blackletter fonts did not have an I. After ~1900 they began to cut I in some (~25 %) Blackletter fonts. The difference can only be seen if both appear in the same text.
AFAIK GT4Hist keeps J. It can be a problem if GT4Hist is combined with AustrianNewspapers.
Quick proof: