Open GoogleCodeExporter opened 9 years ago
The fraktur data files that I trained all use a regular s for the long s.
That's Danish, Swedish and German. Basically, that's because I have no use for
long s'es and if had distinguished between the two types of s, I would have to
run a replacement filter afterwards converting long s'es to regular s'es. Extra
work for no gain. I also think is a common usage scenario (that is, converting
texts to the corresponding Antigue counterparts, since that's all people read
nowadays). So I won't be the one putting in the legwork to creating training
data with long s'es. You are, of course, welcome to this yourself, if you are
up to it.
If you do look into this yourself, you can start with the files, I used, which
are available at
https://github.com/paalberti/tesseract-dan-fraktur
The dictionary is just a small part of it, though. You will have to edit all
the tif/box-files. There is some guidance on editing this type of files in the
wiki pages. And I should note that my files have been updated to work on
tesseract version 3.01 but not yet for version 3.02. That's the next thing I've
got planned for them, but there always seems to be something more important to
do, so I won't get around to that right away.
Just out of curiosity, what do you need the long s for?
Original comment by dsl602...@vip.cybercity.dk
on 12 Apr 2012 at 3:37
There are repercussions of this ſ/s-simplification that decrease recognition
accuracy, IMHO.
There is the issue of training, where you try to squeeze two very different
patterns (ſ/s) into one slot, presumably reducing recognition accuracy.
Example where I think this happens are words which differ only by ſ/s, like
Luſt/Luft. I suspect misrecognition to happen in both directions here, and the
empirical look at ocr'ed text is a first confirmation.
Much more important however are those words which differ only in f/s but where
there exists no ſ-word. Most notorious example is the very often happening
error auf/aus. If there were a ſ-dictionary, and tesseract would see that
there exists 'auf' but not 'auſ' it could decide much better for/against 'auf'
in case of a dubious f/ſ. I am not a tesseract developer so I hope I'm not
only handwaving here. I base the argument on the observation that the
dictionary matters when doing recognition: often with a bad image, dictionary
words are recognized, so a ſ-dictionary would matter in this case.
On the other hand you have bulk-converting ſ to s which is easily and quickly
done. I would never trade a possible gain in recognition with such a
programming one-liner.
Original comment by gtrw...@gmail.com
on 12 Apr 2012 at 5:08
You may have a valid point with the aus/auſ/auf-example. I am certainly not a
tesseract developer so no comment from me on the other issue you mention.
But still, if you'd like to see this changed any time soon, your best chance is
to change it yourself. There are so many things that ought to be tweaked there,
this is not on the top of my list, and it's been some time since I've found the
time and energy to look at any of them.
It is quite likely that you can make this work better than I could, so I hope
you do try.
Original comment by dsl602...@vip.cybercity.dk
on 14 Apr 2012 at 8:15
I hear you. For the case I can make this possible I will add a third reason to
differentiate the dictionary in the German-Fraktur version:
Most fraktur books in german language were printed before 1901, when there
happened the second "orthographic reform". I have only a german link for this:
http://de.wikipedia.org/wiki/Orthographische_Konferenz_von_1901
But even earlier in 1876, there was also another reform, see
http://de.wikipedia.org/wiki/I._Orthographische_Konferenz
These led to marked changes in the dictionary, like sz -> ß, some ss -> ß,
many th -> t, c -> z. Quite a few changes cannot be described by simple rules.
The package, as it is now, uses a modern dictionary, which works well with
Antiqua fonts, on fraktur texts printed before 1901, leading to two results: 1.
correct words are not found in dict, so they must be recognized letter by
letter each time, thus increasing the error rate; 2. with bad images, wrong
dict words are picked out of the noise by the comparison with the dictionary.
So what is needed additionally is a version with a pre-1901 dictionary.
This is well known but noone bothered because ABBYY provided a workable
solution with a previous version of their engine. Now this is discontinued,
people face an expensive web service and begin lamenting. I'm glad I could at
least verbalize these problems here.
Original comment by gtrw...@gmail.com
on 15 Apr 2012 at 5:59
I have windows XP AND I would like to make my own fractur training data file/s.
Are they taylored to a font on the computer or are they trained to variations
of fractur fonts from different texts ?
Is it a simple process to train fr\actur fonts from different texts using
windows XP and Tesseract 3 ?
If possible could I have step by step instructions on how to make fractur
training data files which give greater accuracy than the fractur training data
kindly supplied ?
Original comment by nine.ele...@gmail.com
on 7 Jun 2012 at 9:37
@ dsl602230:
What about to create alternative language pack for 's' recognized as 's', 'ſ'
as 'ſ', and 'f' as 'f'?
Original comment by zde...@gmail.com
on 12 Jan 2014 at 2:52
Thanks for the explanation of why it might be useful to retain the ſ/s
distinction to later in the recognition process and or completely to the
output. These are valid points, but we won't be able to address this issue for
3.04.
The main difficulty is lack of wordlist training data for building the
dictionaries that relate to older variants of German (and other languages).
With 3.04 the wordlists that we use to build the dictionaries will be opened up
and provided in the git repository. If you have access to a wordlist for older
German (perhaps you are an academic working in that field for instance) that
has suitable licensing, it would be easy to add a parallel language directory
for older German and train using that data.
Original comment by theraysm...@gmail.com
on 4 Nov 2014 at 6:46
Original issue reported on code.google.com by
gtrw...@gmail.com
on 12 Apr 2012 at 2:08