Open jbarth-ubhd opened 7 months ago
keras.csv :0.927003 0.468318 # region0002_line0029_word0006 Rä-
kerasNOHYP.csv:0.927003 0.573203 # region0002_line0029_word0006 Rädelsführer
No, hyphens are not treated specially in any way. If you are using model_dta_full.h5, that LM was trained on Deutsches Textarchiv Kernkorpus und Ergänzungstexte plaintext edition, which does contain original type alignment (Zeilenfall, i.e. line breaks), so the model has "seen" hyphens and newlines. However, these texts are very diverse – some contain nearly no hyphens, others make lots of use of it. So I am not sure how well the model really learned to abstract line breaks as a general possibility.
I have not specifically and methodically measured the impact of hyphenation myself, as you have. So thanks for the analysis!
In light of this, perhaps the model should indeed be applied with a dedicated rule: if there's a hyphen-like character at the end of a line, then
...and if inference mode does it this way, then training should explicitly mask all hyphens on the input side, too.
I am not even sure whether I should retain line breaks (newline character) as such.
Note: meanwhile, I found out that the plaintext version of DTA produced via dta-tools tei2txt has more problems for our use-case:
tei:note
)tei:fw
) should all be removed, but in past versions there was interference with line breakstei:lb/@n
), esp. in poems, used to be printed verbatim¬
as hyphen sometimes (coverage of dehyphenation rule not 100%)[FORMEL]
Moreover, there are problems with the textual quality of DTA extended set (Ergänzungstexte):
tei:formula
markup⸗
despite transcription guidelines requiring standard hyphen-minus_
as gap character (which the training did not take into account until now)So I decided to do my own plaintext export from the TEI version, which solves all that – and takes another bold step: it now uses Unicode NFKD normalization, because precomposed characters are much harder to learn (being sparse), esp. with such productive combinations as in polytonic Greek.
I will make some further code modifications to the training procedure (coverage of explicit gap codepoint in the input, cutoff frequency for implicit gaps in the input) and the inference side (removal of normal line breaks and dehyphenation, applying NFKD and other string normalization, persisting the string preprocessor to the model config) and then retrain the DTA model.
Until then, the issue will stay open. If you have additional ideas, please comment.
Dear reader, does keraslm-rate take hyphenated words into account?
Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf
It seems that many of the low rated words have hyphens:
With hyphenation:
Without (manually removed) hyphenation: