OCR-D / ocrd_keraslm

Simple character-based language model using keras
Apache License 2.0
7 stars 6 forks source link

Hyphenated words #22

Open jbarth-ubhd opened 7 months ago

jbarth-ubhd commented 7 months ago

Dear reader, does keraslm-rate take hyphenated words into account?

Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf

It seems that many of the low rated words have hyphens:

With hyphenation:

# median: 0.962098 0.622701 ; mean: 0.948695 0.625144, correlation: 0.315179
# OCR-D-OCR OCR-D-KERAS
0.693236 0.410939  # region0002_line0021_word0003 daf3
0.927003 0.468318  # region0002_line0029_word0006 Rä-
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.931271 0.489822  # region0002_line0000_word0005 pas-
0.928169 0.491138  # region0000_line0004_word0007 sozia-
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494978  # region0003_line0001_word0005 Lyon,
0.926153 0.495819  # region0003_line0000_word0004 Kon-
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496326  # region0002_line0001_word0004 Rousseaus
0.967390 0.496934  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498529  # region0002_line0017_word0006 Lyon
0.910209 0.499826  # region0002_line0018_word0002 Instinktiv
...

Without (manually removed) hyphenation:

# median: 0.962198 0.623943 ; mean: 0.949162 0.628181, correlation: 0.278264
# OCR-D-OCRNOHYP OCR-D-KERNOHYP
0.693236 0.411037  # region0002_line0021_word0003 daf3
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494945  # region0003_line0001_word0005 Lyon,
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496306  # region0002_line0001_word0004 Rousseaus
0.967390 0.496923  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498542  # region0002_line0017_word0006 Lyon
0.910209 0.499822  # region0002_line0018_word0002 Instinktiv
...
jbarth-ubhd commented 7 months ago
keras.csv     :0.927003 0.468318  # region0002_line0029_word0006 Rä-
kerasNOHYP.csv:0.927003 0.573203  # region0002_line0029_word0006 Rädelsführer
bertsky commented 7 months ago

No, hyphens are not treated specially in any way. If you are using model_dta_full.h5, that LM was trained on Deutsches Textarchiv Kernkorpus und Ergänzungstexte plaintext edition, which does contain original type alignment (Zeilenfall, i.e. line breaks), so the model has "seen" hyphens and newlines. However, these texts are very diverse – some contain nearly no hyphens, others make lots of use of it. So I am not sure how well the model really learned to abstract line breaks as a general possibility.

I have not specifically and methodically measured the impact of hyphenation myself, as you have. So thanks for the analysis!

In light of this, perhaps the model should indeed be applied with a dedicated rule: if there's a hyphen-like character at the end of a line, then

bertsky commented 7 months ago

...and if inference mode does it this way, then training should explicitly mask all hyphens on the input side, too.

I am not even sure whether I should retain line breaks (newline character) as such.

bertsky commented 7 months ago

Note: meanwhile, I found out that the plaintext version of DTA produced via dta-tools tei2txt has more problems for our use-case:

Moreover, there are problems with the textual quality of DTA extended set (Ergänzungstexte):

So I decided to do my own plaintext export from the TEI version, which solves all that – and takes another bold step: it now uses Unicode NFKD normalization, because precomposed characters are much harder to learn (being sparse), esp. with such productive combinations as in polytonic Greek.

I will make some further code modifications to the training procedure (coverage of explicit gap codepoint in the input, cutoff frequency for implicit gaps in the input) and the inference side (removal of normal line breaks and dehyphenation, applying NFKD and other string normalization, persisting the string preprocessor to the model config) and then retrain the DTA model.

Until then, the issue will stay open. If you have additional ideas, please comment.