Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
GNU General Public License v3.0
1.05k stars 211 forks source link

Detect typeface styles #84

Open jeanm opened 5 years ago

jeanm commented 5 years ago

Congrats on the fantastic tool. Have you given any thought to making the network aware of italic, bold, etc., as well as different types of typefaces? As far as I can tell this should (hopefully) be a relatively small change.

Here's how I imagine it could be implemented:

  1. Treat all "stylistic" info (the specific font, whether it's bold, italic, etc.) as an extra closed-class classification problem. The person doing the training is responsible for providing info on which kind of stylistic labels are present in the training data. E.g. if the training data has two different typefaces, a main font and an alternate font, and the alternate can optionally be italic, then the new stylistic classifier will have the following classes: main, alternate, alternate_italic.
  2. The training data is somehow annotated for stylistic info. This is the slightly more annoying bit to implement I imagine. One could use some kind of XML markup to denote segments of characters which are in a font different from the main one, e.g.
    This is the main font, then we have <alternate>some text in
    the alternate font</alternate> and finally
    <alternate_italic>the alternate font in italic</alternate_italic>
  3. In the forward pass of the network, the old character classifier is kept, but additionally the new stylistic classifier is also run to predict the correct font.
  4. ???
  5. Profit!
jeanm commented 5 years ago

Here's the really ugly hack I used to make Calamari recognise bold and italic for the latin alphabet. It requires Vim but you can certainly adapt it to other editors.

  1. Take a font editor (I used BirdFont), and open a copy of your favourite font.
  2. Pick a range of characters you're definitely not going to need for your documents and that aren't affected by normalisation (I picked Cyrillic).
  3. Replace them with a copy of the latin alphabet (or whatever characters you need), but alter the letter to be bold (that's easy to do with BirdFont by just increasing the stroke width). Also add another copy in italic (you can automatically slant characters in BirdFont). Then export the font and install it system-wide.
  4. Now you can annotate your ground truth files using your "fake" bold and italic (which are really Cyrillic characters). You can use the script trick I'm using to annotate things quickly. However...
  5. ...since manually writing Cyrillic (and remembering the mapping by heart) is going to be too complex, you can make a script to automatically map your regular characters to their fake bold version, and vice versa. As a very simple example:
    import sys
    regular_chars = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z".split()
    bold_chars = "Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й".split()
    bold = dict(zip(regular_chars, bold_chars))
    for c in sys.stdin.read():
        print(bold.get(c, c), end="")

    Then...

  6. ...you can modify your ~/.vimrc to add a shortcut to call the above script:
    vnoremap b s<c-r>=system('/home/jeanm/filter.py', @")<cr><esc>

    This will replace any contents selected (in visual mode) by the output of the script when pressing b. You can do the same for italic, and to un-bold and un-italicise.

  7. Finally, set your terminal to display characters in your custom font so that "fake bold" is actually shown as bold, etc.
  8. Once you ave your OCR output, you'll then want to do some post-processing to remove the fake bold/italic hack.

This is obviously less than ideal as a solution, but it worked for me. One reason why this isn't as good as having native support, other than the extra work required, is that it would probably be beneficial if related characters (regular A, bold A, italic A) shared the same underlying weights. I've found that with this hack accuracy went down when compared to not distinguishing regular/bold/italic, and I needed to add extra training data.

chreul commented 5 years ago

We experimented with something similar when working on a historical lexicon: https://zenodo.org/record/1451482#.XMReFegzY2w In this case study we decided to treat the task as two separate sequence classification problems: textual OCR and typography tagging. The respective models were trained and applied separately and the results were combined during a postprocessing step using the positional information from Calamari's extended prediction data output. As of now I think that this is the best way to do it. Of course, the computational effort increases but the codecs stay minimal and each model can focus on its specific sub task. I would love a generic implementation of this but @ChWick is a little busy (i.e. lazy) right now :-).

Baciccin commented 4 years ago

Hi @chreul, is the code used for that 2018 paper open source, and if so would you be able to share it?

I too would really like to have typographical information prediction, and I would be willing to put in the work needed to integrate that in Calamari as an optional component (behind a command line flag perhaps), assuming @ChWick et al. are OK with it. The OCR-D folks have a spec (https://github.com/OCR-D/spec/pull/96) for adding typographical information to PAGE-XML via TextStyle elements, and I'm creating a dataset annotated like that for testing purposes.

I'm pretty busy at the moment, but if someone familiar with the codebase can give me a couple of pointers on where to integrate this, I'd be willing to take a stab at it 🙂

chreul commented 4 years ago

Hi, to be honest, I don't think that the code from the Sanders paper would a good place to start with since... well... I wrote it :-). However, Christoph and I later worked on a similar project which should be a much better starting point. The project is available here. There is also a test folder with some example data and the required models. It is worth mentioning, that we used a 3 step approach for this project, namely textual OCR and two different layers of typography which are then combined afterwards. Unfortunately, we didn't rely PageXML in this project (neither did we in the previous one) so there probably is a lot of work to be done. Thank you for your offer to give it a try and please don't hesitate to ask if you need any further help!

Baciccin commented 4 years ago

Thanks for the link @chreul! Do you have a link to the paper so that I can get a better understanding of what's going on? I'm not sure I follow what exactly the "typo1" and "typo2" stages are doing. They just seem to be standard calamari models. By looking at their charsets, which are very limited, I'm guessing you're having them predict something like roman-roman-roman-blank-italic-italic-italic-blank etc., character-by-character. Is that right?

EDIT: found these slides which help clarify a lot. Clever approach, I like it!

chreul commented 4 years ago

For the second use case there is no paper available but it's quite similar and the slides you found (I didn't even know those were available online :-)) are probably the best starting point anyways. In general, we used the same approach as we did for the Sanders use case but with two typography levels instead of one. (As far as I remember) Typo1 recognizes normal font, small font and supercript whereas Typo2 focuses on properties like italic, small caps, etc. Another difference compared to the Sanders approach is that the typographical attributes can change within a word which is why we couldn't apply the voting approach on word level and simply assigned the typography outputs to the textual output using a simple heuristic based on positional information.

Baciccin commented 4 years ago

Just some random thoughts – I like your solution, but I wonder if we couldn't get away with something even simpler. Problems I see with using a second calamari model for typography:

  1. The model will have to learn to do word and character alignment from scratch, which feels unnecessary since the OCR model is already doing that.
  2. Having to create a second set of training data might be error prone – if you have a 10-character bold word, you end up having to annotate bbbbbbbbbb, but it's quite easy to accidentally type more/fewer bs.

How about using a pixel classifier on the line images instead?

  1. Training data is trivially created from PAGE-XML files, assuming segmentation is done on the word/character level.
  2. At inference time, Calamari could load the pixel classifier, call it for every line, then intersect the pixel classifier's predictions with its own character positional information to determine typographical information down to the character-level.
ChWick commented 4 years ago

Just random answers:

  1. You can use the ATR model as "pretrained weights" so you do not have to start from scratch. One alternative is to train both models in parallel, i.e. sharing conv, pool, lstm layers, and add two FC layers (one OCR one Typo) and two loss functions. I tested this, but it performed very similar (even a bit worse) to having two models. This code is however not integrated in Calamari.
  2. Having bbbbbbbb instead of one b has several advantages: a) Possible to capture typographic changes within a word (we had a project where this was the case quite often), b) it is straightforward to use the pretrained OCR weights. I think @chreul I think tested this approach

Using a PC to determine the typography at each "Pixel" seems an interesting idea and I would assume good results, however:

  1. The word/character level annotation might not be very accurate
  2. This must be fully implemented (a lot of work) The big advantage is that the "alignment" step is omitted, and I like that! So feel free to test this approach! It will work if you have enough time and training data.

It is also possible to "share" some code. I use the positional prediction of the Calamari types to obtain a pixel-wise labeling (similar to your PC approach!) to solve the alignment.