Closed kba closed 5 years ago
IINM, there might be multiple use cases with different goals:
Since local features will be most useful anyway, and since everything (including global features) can be accomodated by PAGE, I suggest keeping METS completely out here. PAGE's TextStyle
element offers among others:
@fontFamily
, where we are free to define strings to our liking (including a string serialization of a feature combination)@monospace
and @serif
@bold
, @italic
, @letterSpaced
All of which would be most useful if resolved up to the Word
level. Is that even possible with reasonable accuracy? Then there is of course @primaryScript
and @secondaryScript
(e.g. Latn
vs Latf
), which probably makes more sense on the TextRegion
or Page
level.
Font family might have to be more fine-grained and more locally resolved for OCR than for other use cases. (Certainly something between blackletter vs antiqua and font equivalence classes of the X font system, but possibly also including character set information like "including ſ" or "excluding ß".) So maybe there is more to this than sharply discernable clusters: font classification based OCR error rate? @stweil @noahmetzger
Local resolution (how sharp can boundaries of those font features become in the text sequence) is probably a two-sided problem: it is difficult to get high precision at high resolution, and it is also difficult to annotate at high resolution without impairing OCR accuracy (because recognition might want to decide for Word
and Glyph
coordinates on its own – see #77 –, or even not enforce word segmentation at all, see #72).
And maybe formatting features should be allowed to carry confidence values, too?
@kba, if you want to describe how text is transcribed in the OCR results, then it uses long S is not sufficient. Tesseract for example also outputs special UTF-8 characters for some ligatures like ae, oe, ff, st and more.
@stweil, I think this is more about prescribing allowed character sets. If we know from metadata or the font classifier component that the whole text can or cannot contain ſ
, ß
etc., then we would like OCR to know as well. Moreover, IIRC the spec says we must not produce consonant ligatures like ff
or st
(GT level 2).
Maybe for Tesseract all this is doable with tessedit_char_blacklist
and tessedit_char_whitelist
at runtime?
@stweil True, these are just examples to start a discussion. At this point, I'm most interested what types of font information needs there are and how we model them so we can produce matching ground truth data ASAP.
@bertsky Thanks for the feedback.
@VChristlein @seuretm You mentioned in the call that you've been identifying some 8 clusters of types. Can you explain the features that they share?
Off the top of my head I am thinking at the moment to misuse @fontFamily
to contain space separated (think HTML @class
) font names and font feature sets from a controlled vocabulary. Possibly with suffixed confidence value (like fontFamily="blackletter schwabacher:0.7 textualis:0.3"
). And possibly a convention to use the boolean attributes like bold
, italic
only if confidence is 1
and fall back to further misuse fontFamily
to include them there with confidence values.
@VChristlein @seuretm We need your input here. Would you prefer to discuss this online (i.e. VC) or could you leave your thoughts here?
@noahmetzger Could you pls. check a) whether tessedit_char_blacklist
and tessedit_char_whitelist
are working with lstm models and b) if they are exposed via tesserocr
? Thanks!
Hi, @seuretm currently is working on a font group recognition system. From an accuracy point of view, the more text we have the more accurate it becomes (if there are no multiple fonts), as far as I know, @seuretm currently works on a per-patch basis, while it is currently not fully-convolutional it could easily be converted, then it could be used for bigger resolutions (currently many random patches are just averaged), i.e. full Images/paragraphs, lines should work too (if padded accordingly), I guess words are too small to be recognized accurately enough. Thus, it would make sense if the information could be put anywhere in PAGE, but for a starter I'd only consider full pages (or?). Mathias also wanted to start now the integration in OCR-D and might need some help here, I'll tell him to follow this discussion.
@wrznr, I'm afraid that black list / white list currently require a legacy model, see Tesseract wiki. Perhaps @noahmetzger can have a look whether this feature can be added for LSTM models, too.
Regarding blacklist and whitelist, if it is too much of an effort, we could also delegate this to our API wrapper ocrd_tesserocr
, which would then have two options:
GetChoiceIterator
or GetGlyphConfidences
to explicitly exclude certain characters.
The latter would be a fall-back option if no such models exist / are installed. Something along those lines could surely be done for other OCRs as well.
Hello everybody,
Regarding the number of fonts (or type groups), we currently are currently using images selected by Saskia which have 8 different types (Antiqua, Bastarda, Fraktur, Greek, Hebrew, Kursiv, Rotunda, and Textura). Training a CNN on the 15th century pages with some data augmentation allows to get a classification accuracy of roughly 95%. This approach however does not appear to cluster together pages from unseen scripts (the later ones) when visualizing the outputs of the 2 last layers of the CNN. This means automatic clustering unseen fonts using this CNN is likely to fail. To solve this issue, I am currently working on a method combining a variational autoencoder (VAE) with a classifier.
Regarding the classification granularity, the CNN has currently an input view of 224x224 pixels; this means that potentially this approach can be applied for inputs of this size. As document images are larger, the average score of many crops is used for determining a single type group per page, but classification results for the various crops can be used. Saskia has noticed that some "misclassified" crops were actually correct, as some pages have a small amount of text with another font (I do not take this case into account yet). However, I expect that there is a lower bound for the amount of text that can be used for a correct type group identification.
I agree that it would be best not to store only the classification result, but rather a score for each type group. I dislike throwing away information. It is unfortunate that there is no way in PAGE (and to my knowledge in all other frequently used formats) to store several possibilities with their respective confidence. In any case, I can provide scores for all type groups known by the classifier instead of a single one. If we store the data as Konstantin suggests ("type:conf type:conf ..."), and if the values are sorted by confidence, then a string split on ":" can easily get the value with the highest confidence.
@wrznr @stweil i will have a look at this
I agree that it would be best not to store only the classification result, but rather a score for each type group. I dislike throwing away information. It is unfortunate that there is no way in PAGE (and to my knowledge in all other frequently used formats) to store several possibilities with their respective confidence.
Hi @seuretm, ALTO currently includes support for recognition variants with their respective confidence values on either word-level using <ALTERNATIVEType>
or character-level using <VariantType>
. See e.g. https://github.com/altoxml/documentation/blob/master/v4/GlyphSamples/Glyph_Sample02_AlternativeClarification.xml.
@tboenig is currently working on some PAGE XML examples including font information. They will be added to the assets repo.
In the GroundTruth data the font information (type, cut...) are documented in two places in PAGE XML.
<TextLine custom="textStyle {fontFamily:Arial; fontSize:17.0; bold:true;}">
The keyword for this information is textStyle. For the font: fontFamily, for the size: fontSize and for the typographic style the characteristic feature. See: http://www.ocr-d.de/sites/all/gt_guidelines/lyTypographie.htmlThis information is primarily recorded in the
<TextStyle fontFamily="Arial" fontSize="17.0" bold="true"/>
However, since not all typographic information can be stored in TextStyle, this is in the case of:
Problems:
<TextRegioncustom="textStyle {fontFamily:Arial:Times:Courier; }">
<TextStyle fontFamily="Arial:Times:Courier"/>
<TextLine custom="textStyle {fontFamily:Arial:Times; }">
<TextStyle fontFamily="Arial:Times"/>
<Word custom="textStyle {fontFamily:Arial; }">
<TextStyle fontFamily="Arial"/>
The attribute fontFamily must also be used for the documentation of font clusters.
<TextRegion custom="textStyle {bold="true"}">
<TextStyle bold="true"/>
only the whole TextRegion<TextLine custom="textStyle {bold="true"}">
<TextStyle bold="true"/>
only the whole TextLine<Word custom="textStyle {bold="true"}">
<TextStyle bold="true">
only the whole WordThe documentation of these features is also stored in the METS file. In this case, this information is extracted from the page file. The information is documented in the
<dmSec ID="dmd001">
<mdWrap MIMETYPE="text/XML" MDTYPE="PAGEXML" LABEL="PAGE XML">
<xmlData>
<page:TextRegion id="r_1_1" custom="textStyle {fontFamily:Arial:Times:Courier; }">
<page:TextStyle id="re_1_1" fontFamily="Arial:Times:Courier"/>
<page:TextLine id="l_1_1"custom="textStyle {fontFamily:Arial:Times; }">
<page:TextStyle id="li_1_1"fontFamily="Arial:Times"/>
<page;Word id="w_1_1"custom="textStyle {fontFamily:Arial; }">
<page:TextStyle id="wo_1_1"fontFamily="Arial"/>
</xmlData>
</mdWrap>
</dmSec>
@tboenig @maria-federbusch Great work, many thanks. Pls. realize as PR.
Dear all,
The specifications for font information and typographic features are documented in ocr-d/spec (see https://github.com/OCR-D/spec). I would like to thank everyone who participated in the discussion.
merged with #96
What kind of font information do we need to encode?
...
Do we encode this page-wise (in METS) or in PAGE?
How do we want to encode it? Does PAGE have all the necessary attributes so we do not have to rely on
custom
?@VChristlein @bertsky @finkf @seuretm