Hyphenated words - Githubissues

jbarth-ubhd commented 7 months ago

Dear reader, does keraslm-rate take hyphenated words into account?

Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf

It seems that many of the low rated words have hyphens:

With hyphenation:

# median: 0.962098 0.622701 ; mean: 0.948695 0.625144, correlation: 0.315179
# OCR-D-OCR OCR-D-KERAS
0.693236 0.410939  # region0002_line0021_word0003 daf3
0.927003 0.468318  # region0002_line0029_word0006 Rä-
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.931271 0.489822  # region0002_line0000_word0005 pas-
0.928169 0.491138  # region0000_line0004_word0007 sozia-
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494978  # region0003_line0001_word0005 Lyon,
0.926153 0.495819  # region0003_line0000_word0004 Kon-
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496326  # region0002_line0001_word0004 Rousseaus
0.967390 0.496934  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498529  # region0002_line0017_word0006 Lyon
0.910209 0.499826  # region0002_line0018_word0002 Instinktiv
...

Without (manually removed) hyphenation:

# median: 0.962198 0.623943 ; mean: 0.949162 0.628181, correlation: 0.278264
# OCR-D-OCRNOHYP OCR-D-KERNOHYP
0.693236 0.411037  # region0002_line0021_word0003 daf3
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494945  # region0003_line0001_word0005 Lyon,
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496306  # region0002_line0001_word0004 Rousseaus
0.967390 0.496923  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498542  # region0002_line0017_word0006 Lyon
0.910209 0.499822  # region0002_line0018_word0002 Instinktiv
...

jbarth-ubhd commented 7 months ago

keras.csv     :0.927003 0.468318  # region0002_line0029_word0006 Rä-
kerasNOHYP.csv:0.927003 0.573203  # region0002_line0029_word0006 Rädelsführer

bertsky commented 7 months ago

No, hyphens are not treated specially in any way. If you are using model_dta_full.h5, that LM was trained on Deutsches Textarchiv Kernkorpus und Ergänzungstexte plaintext edition, which does contain original type alignment (Zeilenfall, i.e. line breaks), so the model has "seen" hyphens and newlines. However, these texts are very diverse – some contain nearly no hyphens, others make lots of use of it. So I am not sure how well the model really learned to abstract line breaks as a general possibility.

I have not specifically and methodically measured the impact of hyphenation myself, as you have. So thanks for the analysis!

In light of this, perhaps the model should indeed be applied with a dedicated rule: if there's a hyphen-like character at the end of a line, then

in explicit state transfer mode (for example with alternative decoding): keep the LM state right before the hyphen and continue with it after the line break
in linear mode: remove the hyphen (rating it with a fixed score) and the newline character, maybe insert a newline after the token (ignoring its probability output)

bertsky commented 7 months ago

...and if inference mode does it this way, then training should explicitly mask all hyphens on the input side, too.

I am not even sure whether I should retain line breaks (newline character) as such.

bertsky commented 7 months ago

Note: meanwhile, I found out that the plaintext version of DTA produced via dta-tools tei2txt has more problems for our use-case:

still contains marginals, footnotes and endnotes (tei:note)
catchword, print signature, page number, running header (tei:fw) should all be removed, but in past versions there was interference with line breaks
line breaks with line identifiers (tei:lb/@n), esp. in poems, used to be printed verbatim
still contains line breaks with ¬ as hyphen sometimes (coverage of dehyphenation rule not 100%)
formulae get printed as [FORMEL]
contains title pages, indexes, tables and figures, too
usage of tab stop in replacement rules

Moreover, there are problems with the textual quality of DTA extended set (Ergänzungstexte):

contains lots of mathematical symbols (often occuring only in one document), because of missing tei:formula markup
contains lots of musical symbols, box drawing characters, Canadian syllabics – too sparse
accidental usage of similar looking glyphs
inconsistent usage of punctuation symbols (esp. quotes, dashes, brackets, indexes)
some usage of inverted characters for respective printing errors
some usage of Fraktur hyphen ⸗ despite transcription guidelines requiring standard hyphen-minus
some usage of Fraktur consonant ligatures despite guidelines
rare presence of byte-order mark, object replacement character, soft hyphen
generally a long tail of rare symbols which would spoil the LM due to sparseness
usage of _ as gap character (which the training did not take into account until now)

So I decided to do my own plaintext export from the TEI version, which solves all that – and takes another bold step: it now uses Unicode NFKD normalization, because precomposed characters are much harder to learn (being sparse), esp. with such productive combinations as in polytonic Greek.

I will make some further code modifications to the training procedure (coverage of explicit gap codepoint in the input, cutoff frequency for implicit gaps in the input) and the inference side (removal of normal line breaks and dehyphenation, applying NFKD and other string normalization, persisting the string preprocessor to the model config) and then retrain the DTA model.

Until then, the issue will stay open. If you have additional ideas, please comment.

OCR-D / ocrd_keraslm

Hyphenated words #22