Open jbarth-ubhd opened 11 months ago
Which one of the two models is "better", and how did you compare them?
Comparison in sense of "check if the model files have the same content".
That's strange indeed. It's not to be expected from the vanilla tesstrain rules (even the fast variant just does ConvertToInt). And the concrete wordlist looks very awkward (contains 400k fullforms, nearly half of which are made of strange punctuation characters indicative of absent tokenisation, and the actual tokens are clearly scraped off the web, not historic at all). I would understand if the wordlist from deu
or frk
is used in frak2021
, but that's not the case at all.
@stweil can you explain?
frak2021_1.069.traineddata
was made from the original training result, but with additional components like wordlist, number und punctuation hints (frak2021_1.069.lstm-word-dawg
, frak2021_1.069.lstm-number-dawg
, frak2021_1.069.lstm-punc-dawg
). Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata
, but I'd have to check). Sort the word list before comparing it with other word lists.
Because of the additional components the file frak2021_1.069.traineddata
is larger.
Typically models with a (ideally domain specific) wordlist can achieve slightly higher recognition rates, but sometimes it can also lead to OCR results which differ from the printed text.
And yes, this word list contains a lot of entries which should be removed. That's inherited from all standard Tesseract word lists.
Those additional components are based on the components from a Tesseract standard model (as far as I remember on
Fraktur.traineddata
, but I'd have to check)
No, the latter word list is about twice the size, also with texts from the web, but contains none of these strange words with punctuation (non-tokenised), and does contain ſ
, which yours does not.
Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)
Regardless, the word list in that model files looks exceptionally bad (much worse than the Tesseract word lists) and should be improved.
Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)
I have now distilled a list of full forms, capping at different frequencies, respectively:
I filtered by part-of-speech, removing punctuation, numbers and non-words (XY):
select trim(u,'"') from csv where f > 100 and p != "$(" and p != "$," and p != "$." and p != "FM.xy" and p != "CARD" and p != "XY";
Furthermore, I removed those entries which have not been properly tokenised (indicated by leading punctuation) or are merely numbers (but still do not get p=CARD):
grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$'
The quality is very good!
Maybe I'll also recompose the number and punc DAWGs for the additional historic patterns (e.g. ⸗
instead of hyphen, solidus instead of comma) and remove the contemporary ones (€
sign etc).
I will try to use this with frak2021, but also GT4HistOCR and others.
I guess I'll do some recognition experiments and evaluation before publishing the modified models.
10: 314248 words
50: 100516 words
100: 60403 words
I will try to use this with frak2021, but also GT4HistOCR and others.
Done: see
In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts. It would be more interesting to use it with german_print.
In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts. It would be more interesting to use it with german_print.
sure, that's why it's among the models I build the dict into – see full list of assets
Some evaluation (which material which model whether dict or not and which cap freq preferable) will follow.
Here my small tool for checking the wordlists of .traineddata files:
https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20
@bertsky : but frak2021_dta10+100 do not contain »ſ«:
AMBIGIOUS (EXCERPT): 1sten A/ AP. As. AZ. Basalt- Bauers- Besitz- Bietsch- c. cas. Centralbl. Chrysost. cl. Corn. dial. Diener. Ding- Dinge. Ebd. Eigen- eigentl. Eisen. euch. Eurip. fgm. FML. fundam. g1 Gebiets- Geitz- Generals- G.n GOtts Griseb. Haubt- haus- HErre hsg. inst. Jahrbb. Jungfrau- k. Kg. Kiefer- Lactant. lap. legit. Loose. Magdalenen- Mai- Mehl= Namen. nat. neu- NJmb Normal- O1 Pall. pan. Pfand- Pfl. proc. Reb- redet. Rev. Rhodigin. Rich. Roman- Sc. Schulen. Schweine- Sed. SEin SJndt Spargel- Spitz- Strom. Syllog. Trauben- Trav. Trias- Trift- VIEUSSENS. VVilliam Wach- W.-B. wohl- Wolf. XCVII. y2 Ztg. zwei-
264677 lines
0.00 % lines with »ſ«
0.64 % lines all-UPPERCASE
3.51 % lines ambigious
Indeed – something went wrong. Thanks @jbarth-ubhd, I'll investigate!
Ok, I found the problem. See new release.
346632 lines
16.37 % lines with »ſ«
0.19 % lines all-UPPERCASE
132.80 % lines ambigious
What's with the > 100% BTW?
>100% is because I've inspected only every 1/0.003th word, to keep output compact and multiply the count - I'll have a look at this.
Just inspected frak2021_dta50.traineddata.
Ambigious:
welchẽ (not NFC) Welchs weñ (not NFC) Weñ (not NFC) wenigen
a lot of spaces after words(?). And not NFC (double counting, my bug.)
Spaces were not in frak2021_dta10/100 I've downloaded till Jan 30 11:55.
now with much nicer output:
welchẽ␣␣␣(not NFC) Welchs␣␣␣␣ weñ␣␣␣␣␣␣(not NFC) Weñ␣␣␣␣␣␣(not NFC) wenigen␣␣␣
a lot of spaces after words(?).
wow, I should have checked. Thanks again for being thorough @jbarth-ubhd – much appreciated!
see new release
And not NFC (double counting, my bug.)
Do we really want that? (Even if DTA decided not to do it?)
If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.
If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.
I just checked: tesstrain does NFC on the input GT (via unicodedata.normalize
in generate_line_box.py
). And calamari-train does by default. Kraken's ketos train offers it, but it does not seem to be default.
It is also used in most CER measurement tools.
I feel obliged to comply with this obvious convention in the OCR space.
Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.
And I already have a Tesseract branch which no longer requires box and lstmf files for the training.
Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.
Right, but you can choose, --norm_mode
(or NORM_MODE
in tesstrain) Normalization mode
:
And it's configured differently for various mother tongues.
So my fixed NFC in the DTA LM was premature is what you are saying @stweil?
No, my comment was just meant as an information for you.
Comparison frak2021 … _dta50:
4160da1e088452fcec11df5a411d9a91 /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021_dta50.traineddata
234e8bb819042f615576bd01aa2419fd /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021.traineddata
with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.
So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)? That's not the kind of impact which is desired.
with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.
me too. But the averages do go down overall (if just a little) in my experiments.
I did not fiddle with WORD_DAWG_FACTOR
yet.
So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)?
It would appear so. But there may be a general problem with re-integrating the punctuation DAWG. I am also still trying to modify it in a way to cover extra punctuation characters like ⸗
and —
and –
. The problem is that Tesseract does not have code to de/serialise it from/to anything other than binary form. (I would have expected at least one of the old automaton text formats like AT&T. Unclear how these FSTs came to be in the first place. Manually?)
I've compared these frak models:
ocrd: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata from ocrd resmgr
ubma: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069.traineddata from https://ocr-bw.bib.uni-mannheim.de/faq/
size & md5sum:
content after
combine_tessdata -u x.traineddata aa
:ubma is with .lstm-word-dawg, ocrd is without.
ocrd is 3.3M lstm size, ubma is 432k lstm size.
shouldn't ocrd use the ubma file for fraktur/gothic?