OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
71 stars 18 forks source link

frak models in ocrd resmgr #404

Open jbarth-ubhd opened 8 months ago

jbarth-ubhd commented 8 months ago

I've compared these frak models:

ocrd: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata from ocrd resmgr

ubma: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069.traineddata from https://ocr-bw.bib.uni-mannheim.de/faq/

size & md5sum:

-rw-rw-r-- 1 jb jb 3421140 Mär 27  2021 ocrd--frak2021-0.905.traineddata
234e8bb819042f615576bd01aa2419fd  ocrd--frak2021-0.905.traineddata
-rw-rw-r-- 1 jb jb 5060763 Dez  9  2021 ubma--frak2021_1.069.traineddata
9405b1603db21cb066e4e7614a405dd4  ubma--frak2021_1.069.traineddata

content after combine_tessdata -u x.traineddata aa :

jb@nuc:~/models$ LC_ALL=C ls -lh ocrd ubma
ocrd:
total 3.3M
-rw-rw-r-- 1 jb jb 3.3M Dec 21 12:18 aa.lstm
-rw-rw-r-- 1 jb jb 2.8K Dec 21 12:18 aa.lstm-recoder
-rw-rw-r-- 1 jb jb  22K Dec 21 12:18 aa.lstm-unicharset
-rw-rw-r-- 1 jb jb   30 Dec 21 12:18 aa.version
-rw-rw-r-- 1 jb jb  345 Dec 21 12:18 extr.log

ubma:
total 4.9M
-rw-rw-r-- 1 jb jb 432K Dec 21 12:18 aa.lstm
-rw-rw-r-- 1 jb jb 6.3K Dec 21 12:18 aa.lstm-number-dawg
-rw-rw-r-- 1 jb jb 4.5K Dec 21 12:18 aa.lstm-punc-dawg
-rw-rw-r-- 1 jb jb 2.8K Dec 21 12:18 aa.lstm-recoder
-rw-rw-r-- 1 jb jb  22K Dec 21 12:18 aa.lstm-unicharset
-rw-rw-r-- 1 jb jb 4.4M Dec 21 12:18 aa.lstm-word-dawg
-rw-rw-r-- 1 jb jb   30 Dec 21 12:18 aa.version
-rw-rw-r-- 1 jb jb  553 Dec 21 12:18 extr.log

ubma is with .lstm-word-dawg, ocrd is without.

ocrd is 3.3M lstm size, ubma is 432k lstm size.

shouldn't ocrd use the ubma file for fraktur/gothic?

stweil commented 7 months ago

Which one of the two models is "better", and how did you compare them?

jbarth-ubhd commented 7 months ago

Comparison in sense of "check if the model files have the same content".

bertsky commented 7 months ago

That's strange indeed. It's not to be expected from the vanilla tesstrain rules (even the fast variant just does ConvertToInt). And the concrete wordlist looks very awkward (contains 400k fullforms, nearly half of which are made of strange punctuation characters indicative of absent tokenisation, and the actual tokens are clearly scraped off the web, not historic at all). I would understand if the wordlist from deu or frk is used in frak2021, but that's not the case at all.

@stweil can you explain?

stweil commented 7 months ago

frak2021_1.069.traineddata was made from the original training result, but with additional components like wordlist, number und punctuation hints (frak2021_1.069.lstm-word-dawg, frak2021_1.069.lstm-number-dawg, frak2021_1.069.lstm-punc-dawg). Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata, but I'd have to check). Sort the word list before comparing it with other word lists.

Because of the additional components the file frak2021_1.069.traineddata is larger.

Typically models with a (ideally domain specific) wordlist can achieve slightly higher recognition rates, but sometimes it can also lead to OCR results which differ from the printed text.

And yes, this word list contains a lot of entries which should be removed. That's inherited from all standard Tesseract word lists.

bertsky commented 7 months ago

Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata, but I'd have to check)

No, the latter word list is about twice the size, also with texts from the web, but contains none of these strange words with punctuation (non-tokenised), and does contain ſ, which yours does not.

Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)

Regardless, the word list in that model files looks exceptionally bad (much worse than the Tesseract word lists) and should be improved.

bertsky commented 7 months ago

Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)

I have now distilled a list of full forms, capping at different frequencies, respectively:

I filtered by part-of-speech, removing punctuation, numbers and non-words (XY):

select trim(u,'"') from csv where f > 100 and p != "$(" and p != "$," and p != "$." and p != "FM.xy" and p != "CARD" and p != "XY";

Furthermore, I removed those entries which have not been properly tokenised (indicated by leading punctuation) or are merely numbers (but still do not get p=CARD):

grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$'

The quality is very good!

Maybe I'll also recompose the number and punc DAWGs for the additional historic patterns (e.g. instead of hyphen, solidus instead of comma) and remove the contemporary ones ( sign etc).

I will try to use this with frak2021, but also GT4HistOCR and others.

I guess I'll do some recognition experiments and evaluation before publishing the modified models.

bertsky commented 7 months ago
  • 10: 314248 words

  • 50: 100516 words

  • 100: 60403 words

I will try to use this with frak2021, but also GT4HistOCR and others.

Done: see

stweil commented 7 months ago

In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts. It would be more interesting to use it with german_print.

bertsky commented 7 months ago

In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts. It would be more interesting to use it with german_print.

sure, that's why it's among the models I build the dict into – see full list of assets

Some evaluation (which material which model whether dict or not and which cap freq preferable) will follow.

jbarth-ubhd commented 7 months ago

Here my small tool for checking the wordlists of .traineddata files:

https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20

jbarth-ubhd commented 7 months ago

@bertsky : but frak2021_dta10+100 do not contain »ſ«:

AMBIGIOUS (EXCERPT): 1sten A/ AP. As. AZ. Basalt- Bauers- Besitz- Bietsch- c. cas. Centralbl. Chrysost. cl. Corn. dial. Diener. Ding- Dinge. Ebd. Eigen- eigentl. Eisen. euch. Eurip. fgm. FML. fundam. g1 Gebiets- Geitz- Generals- G.n GOtts Griseb. Haubt- haus- HErre hsg. inst. Jahrbb. Jungfrau- k. Kg. Kiefer- Lactant. lap. legit. Loose. Magdalenen- Mai- Mehl= Namen. nat. neu- NJmb Normal- O1 Pall. pan. Pfand- Pfl. proc. Reb- redet. Rev. Rhodigin. Rich. Roman- Sc. Schulen. Schweine- Sed. SEin SJndt Spargel- Spitz- Strom. Syllog. Trauben- Trav. Trias- Trift- VIEUSSENS. VVilliam Wach- W.-B. wohl- Wolf. XCVII. y2 Ztg. zwei-

264677 lines
0.00 % lines with »ſ«
0.64 % lines all-UPPERCASE
3.51 % lines ambigious
bertsky commented 7 months ago

Indeed – something went wrong. Thanks @jbarth-ubhd, I'll investigate!

bertsky commented 7 months ago

Ok, I found the problem. See new release.

346632 lines
16.37 % lines with »ſ«
0.19 % lines all-UPPERCASE
132.80 % lines ambigious

What's with the > 100% BTW?

jbarth-ubhd commented 7 months ago

>100% is because I've inspected only every 1/0.003th word, to keep output compact and multiply the count - I'll have a look at this.

jbarth-ubhd commented 7 months ago

Just inspected frak2021_dta50.traineddata.

Ambigious:

welchẽ   (not NFC) Welchs     weñ      (not NFC) Weñ      (not NFC) wenigen

a lot of spaces after words(?). And not NFC (double counting, my bug.)

Spaces were not in frak2021_dta10/100 I've downloaded till Jan 30 11:55.

jbarth-ubhd commented 7 months ago

now with much nicer output:

welchẽ␣␣␣(not NFC) Welchs␣␣␣␣ weñ␣␣␣␣␣␣(not NFC) Weñ␣␣␣␣␣␣(not NFC) wenigen␣␣␣

bertsky commented 7 months ago

a lot of spaces after words(?).

wow, I should have checked. Thanks again for being thorough @jbarth-ubhd – much appreciated!

see new release

And not NFC (double counting, my bug.)

Do we really want that? (Even if DTA decided not to do it?)

jbarth-ubhd commented 7 months ago

If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.

bertsky commented 7 months ago

If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.

I just checked: tesstrain does NFC on the input GT (via unicodedata.normalize in generate_line_box.py). And calamari-train does by default. Kraken's ketos train offers it, but it does not seem to be default.

It is also used in most CER measurement tools.

I feel obliged to comply with this obvious convention in the OCR space.

bertsky commented 7 months ago

There we go

stweil commented 7 months ago

Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.

And I already have a Tesseract branch which no longer requires box and lstmf files for the training.

bertsky commented 7 months ago

Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.

Right, but you can choose, --norm_mode (or NORM_MODE in tesstrain) Normalization mode:

  1. Combine graphemes,
  2. Split graphemes
  3. Pure unicode

And it's configured differently for various mother tongues.

So my fixed NFC in the DTA LM was premature is what you are saying @stweil?

stweil commented 7 months ago

No, my comment was just meant as an information for you.

jbarth-ubhd commented 7 months ago

Comparison frak2021 … _dta50:

4160da1e088452fcec11df5a411d9a91 /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021_dta50.traineddata

234e8bb819042f615576bd01aa2419fd /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021.traineddata

image

jbarth-ubhd commented 7 months ago

with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.

stweil commented 6 months ago

So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)? That's not the kind of impact which is desired.

bertsky commented 6 months ago

with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.

me too. But the averages do go down overall (if just a little) in my experiments.

I did not fiddle with WORD_DAWG_FACTOR yet.

So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)?

It would appear so. But there may be a general problem with re-integrating the punctuation DAWG. I am also still trying to modify it in a way to cover extra punctuation characters like and and . The problem is that Tesseract does not have code to de/serialise it from/to anything other than binary form. (I would have expected at least one of the old automaton text formats like AT&T. Unclear how these FSTs came to be in the first place. Manually?)