OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

Modelling font information #76

Closed kba closed 5 years ago

kba commented 6 years ago

What kind of font information do we need to encode?

...

Do we encode this page-wise (in METS) or in PAGE?

How do we want to encode it? Does PAGE have all the necessary attributes so we do not have to rely on custom?

@VChristlein @bertsky @finkf @seuretm

bertsky commented 6 years ago

IINM, there might be multiple use cases with different goals:

Since local features will be most useful anyway, and since everything (including global features) can be accomodated by PAGE, I suggest keeping METS completely out here. PAGE's TextStyle element offers among others:

All of which would be most useful if resolved up to the Word level. Is that even possible with reasonable accuracy? Then there is of course @primaryScript and @secondaryScript (e.g. Latn vs Latf), which probably makes more sense on the TextRegion or Page level.

Font family might have to be more fine-grained and more locally resolved for OCR than for other use cases. (Certainly something between blackletter vs antiqua and font equivalence classes of the X font system, but possibly also including character set information like "including ſ" or "excluding ß".) So maybe there is more to this than sharply discernable clusters: font classification based OCR error rate? @stweil @noahmetzger

bertsky commented 6 years ago

Local resolution (how sharp can boundaries of those font features become in the text sequence) is probably a two-sided problem: it is difficult to get high precision at high resolution, and it is also difficult to annotate at high resolution without impairing OCR accuracy (because recognition might want to decide for Word and Glyph coordinates on its own – see #77 –, or even not enforce word segmentation at all, see #72).

And maybe formatting features should be allowed to carry confidence values, too?

stweil commented 6 years ago

@kba, if you want to describe how text is transcribed in the OCR results, then it uses long S is not sufficient. Tesseract for example also outputs special UTF-8 characters for some ligatures like ae, oe, ff, st and more.

bertsky commented 6 years ago

@stweil, I think this is more about prescribing allowed character sets. If we know from metadata or the font classifier component that the whole text can or cannot contain ſ, ß etc., then we would like OCR to know as well. Moreover, IIRC the spec says we must not produce consonant ligatures like or (GT level 2).

Maybe for Tesseract all this is doable with tessedit_char_blacklist and tessedit_char_whitelist at runtime?

kba commented 6 years ago

@stweil True, these are just examples to start a discussion. At this point, I'm most interested what types of font information needs there are and how we model them so we can produce matching ground truth data ASAP.

@bertsky Thanks for the feedback.

@VChristlein @seuretm You mentioned in the call that you've been identifying some 8 clusters of types. Can you explain the features that they share?

Off the top of my head I am thinking at the moment to misuse @fontFamily to contain space separated (think HTML @class) font names and font feature sets from a controlled vocabulary. Possibly with suffixed confidence value (like fontFamily="blackletter schwabacher:0.7 textualis:0.3"). And possibly a convention to use the boolean attributes like bold, italic only if confidence is 1 and fall back to further misuse fontFamily to include them there with confidence values.

wrznr commented 6 years ago

@VChristlein @seuretm We need your input here. Would you prefer to discuss this online (i.e. VC) or could you leave your thoughts here?

wrznr commented 6 years ago

@noahmetzger Could you pls. check a) whether tessedit_char_blacklist and tessedit_char_whitelist are working with lstm models and b) if they are exposed via tesserocr? Thanks!

VChristlein commented 6 years ago

Hi, @seuretm currently is working on a font group recognition system. From an accuracy point of view, the more text we have the more accurate it becomes (if there are no multiple fonts), as far as I know, @seuretm currently works on a per-patch basis, while it is currently not fully-convolutional it could easily be converted, then it could be used for bigger resolutions (currently many random patches are just averaged), i.e. full Images/paragraphs, lines should work too (if padded accordingly), I guess words are too small to be recognized accurately enough. Thus, it would make sense if the information could be put anywhere in PAGE, but for a starter I'd only consider full pages (or?). Mathias also wanted to start now the integration in OCR-D and might need some help here, I'll tell him to follow this discussion.

stweil commented 6 years ago

@wrznr, I'm afraid that black list / white list currently require a legacy model, see Tesseract wiki. Perhaps @noahmetzger can have a look whether this feature can be added for LSTM models, too.

bertsky commented 6 years ago

Regarding blacklist and whitelist, if it is too much of an effort, we could also delegate this to our API wrapper ocrd_tesserocr, which would then have two options:

  1. loading a specific language model known to adhere to the given (character set) constraints implicitly.
  2. using GetChoiceIterator or GetGlyphConfidences to explicitly exclude certain characters. The latter would be a fall-back option if no such models exist / are installed.

Something along those lines could surely be done for other OCRs as well.

seuretm commented 6 years ago

Hello everybody,

Regarding the number of fonts (or type groups), we currently are currently using images selected by Saskia which have 8 different types (Antiqua, Bastarda, Fraktur, Greek, Hebrew, Kursiv, Rotunda, and Textura). Training a CNN on the 15th century pages with some data augmentation allows to get a classification accuracy of roughly 95%. This approach however does not appear to cluster together pages from unseen scripts (the later ones) when visualizing the outputs of the 2 last layers of the CNN. This means automatic clustering unseen fonts using this CNN is likely to fail. To solve this issue, I am currently working on a method combining a variational autoencoder (VAE) with a classifier.

Regarding the classification granularity, the CNN has currently an input view of 224x224 pixels; this means that potentially this approach can be applied for inputs of this size. As document images are larger, the average score of many crops is used for determining a single type group per page, but classification results for the various crops can be used. Saskia has noticed that some "misclassified" crops were actually correct, as some pages have a small amount of text with another font (I do not take this case into account yet). However, I expect that there is a lower bound for the amount of text that can be used for a correct type group identification.

I agree that it would be best not to store only the classification result, but rather a score for each type group. I dislike throwing away information. It is unfortunate that there is no way in PAGE (and to my knowledge in all other frequently used formats) to store several possibilities with their respective confidence. In any case, I can provide scores for all type groups known by the classifier instead of a single one. If we store the data as Konstantin suggests ("type:conf type:conf ..."), and if the values are sorted by confidence, then a string split on ":" can easily get the value with the highest confidence.

noahmetzger commented 6 years ago

@wrznr @stweil i will have a look at this

cneud commented 6 years ago

I agree that it would be best not to store only the classification result, but rather a score for each type group. I dislike throwing away information. It is unfortunate that there is no way in PAGE (and to my knowledge in all other frequently used formats) to store several possibilities with their respective confidence.

Hi @seuretm, ALTO currently includes support for recognition variants with their respective confidence values on either word-level using <ALTERNATIVEType> or character-level using <VariantType>. See e.g. https://github.com/altoxml/documentation/blob/master/v4/GlyphSamples/Glyph_Sample02_AlternativeClarification.xml.

wrznr commented 6 years ago

@tboenig is currently working on some PAGE XML examples including font information. They will be added to the assets repo.

tboenig commented 6 years ago

In the GroundTruth data the font information (type, cut...) are documented in two places in PAGE XML.

  1. As custom value for the elements: , , it is also possible here an example for Textline <TextLine custom="textStyle {fontFamily:Arial; fontSize:17.0; bold:true;}"> The keyword for this information is textStyle. For the font: fontFamily, for the size: fontSize and for the typographic style the characteristic feature. See: http://www.ocr-d.de/sites/all/gt_guidelines/lyTypographie.html

This information is primarily recorded in the element. See: http://www.ocr-d.de/sites/all/gt_guidelines/pagecontent_xsd_Complex_Type_pc_TextStyleType.html?hl=textstyle

  1. and all Information are documented in element <TextStyle fontFamily="Arial" fontSize="17.0" bold="true"/>

However, since not all typographic information can be stored in TextStyle, this is in the case of:

Problems:

  1. different fonts in the paragraph region solution:
    • <TextRegioncustom="textStyle {fontFamily:Arial:Times:Courier; }">
    • <TextStyle fontFamily="Arial:Times:Courier"/>
    • <TextLine custom="textStyle {fontFamily:Arial:Times; }">
    • <TextStyle fontFamily="Arial:Times"/>
    • <Word custom="textStyle {fontFamily:Arial; }">
    • <TextStyle fontFamily="Arial"/>

The attribute fontFamily must also be used for the documentation of font clusters.

  1. different fonts in typographic style
    • <TextRegion custom="textStyle {bold="true"}">
    • <TextStyle bold="true"/> only the whole TextRegion
    • <TextLine custom="textStyle {bold="true"}">
    • <TextStyle bold="true"/> only the whole TextLine
    • <Word custom="textStyle {bold="true"}">
    • <TextStyle bold="true"> only the whole Word

The documentation of these features is also stored in the METS file. In this case, this information is extracted from the page file. The information is documented in the area.

<dmSec ID="dmd001">
<mdWrap MIMETYPE="text/XML" MDTYPE="PAGEXML" LABEL="PAGE XML">
<xmlData>
 <page:TextRegion id="r_1_1" custom="textStyle {fontFamily:Arial:Times:Courier; }">
    <page:TextStyle id="re_1_1" fontFamily="Arial:Times:Courier"/>
 <page:TextLine id="l_1_1"custom="textStyle {fontFamily:Arial:Times; }">
    <page:TextStyle id="li_1_1"fontFamily="Arial:Times"/>
 <page;Word id="w_1_1"custom="textStyle {fontFamily:Arial; }">
   <page:TextStyle id="wo_1_1"fontFamily="Arial"/>
</xmlData>
</mdWrap>
</dmSec>
wrznr commented 6 years ago

@tboenig @maria-federbusch Great work, many thanks. Pls. realize as PR.

tboenig commented 6 years ago

Dear all,

The specifications for font information and typographic features are documented in ocr-d/spec (see https://github.com/OCR-D/spec). I would like to thank everyone who participated in the discussion.

kba commented 6 years ago

Pull requests: https://github.com/OCR-D/spec/pull/95 / https://github.com/OCR-D/spec/pull/96

kba commented 5 years ago

merged with #96