Open jeanm opened 5 years ago
Here's the really ugly hack I used to make Calamari recognise bold and italic for the latin alphabet. It requires Vim but you can certainly adapt it to other editors.
import sys
regular_chars = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z".split()
bold_chars = "Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й".split()
bold = dict(zip(regular_chars, bold_chars))
for c in sys.stdin.read():
print(bold.get(c, c), end="")
Then...
~/.vimrc
to add a shortcut to call the above script:
vnoremap b s<c-r>=system('/home/jeanm/filter.py', @")<cr><esc>
This will replace any contents selected (in visual mode) by the output of the script when pressing b. You can do the same for italic, and to un-bold and un-italicise.
This is obviously less than ideal as a solution, but it worked for me. One reason why this isn't as good as having native support, other than the extra work required, is that it would probably be beneficial if related characters (regular A, bold A, italic A) shared the same underlying weights. I've found that with this hack accuracy went down when compared to not distinguishing regular/bold/italic, and I needed to add extra training data.
We experimented with something similar when working on a historical lexicon: https://zenodo.org/record/1451482#.XMReFegzY2w In this case study we decided to treat the task as two separate sequence classification problems: textual OCR and typography tagging. The respective models were trained and applied separately and the results were combined during a postprocessing step using the positional information from Calamari's extended prediction data output. As of now I think that this is the best way to do it. Of course, the computational effort increases but the codecs stay minimal and each model can focus on its specific sub task. I would love a generic implementation of this but @ChWick is a little busy (i.e. lazy) right now :-).
Hi @chreul, is the code used for that 2018 paper open source, and if so would you be able to share it?
I too would really like to have typographical information prediction, and I would be willing to put in the work needed to integrate that in Calamari as an optional component (behind a command line flag perhaps), assuming @ChWick et al. are OK with it. The OCR-D folks have a spec (https://github.com/OCR-D/spec/pull/96) for adding typographical information to PAGE-XML via TextStyle
elements, and I'm creating a dataset annotated like that for testing purposes.
I'm pretty busy at the moment, but if someone familiar with the codebase can give me a couple of pointers on where to integrate this, I'd be willing to take a stab at it 🙂
Hi, to be honest, I don't think that the code from the Sanders paper would a good place to start with since... well... I wrote it :-). However, Christoph and I later worked on a similar project which should be a much better starting point. The project is available here. There is also a test folder with some example data and the required models. It is worth mentioning, that we used a 3 step approach for this project, namely textual OCR and two different layers of typography which are then combined afterwards. Unfortunately, we didn't rely PageXML in this project (neither did we in the previous one) so there probably is a lot of work to be done. Thank you for your offer to give it a try and please don't hesitate to ask if you need any further help!
Thanks for the link @chreul! Do you have a link to the paper so that I can get a better understanding of what's going on? I'm not sure I follow what exactly the "typo1" and "typo2" stages are doing. They just seem to be standard calamari models. By looking at their charsets, which are very limited, I'm guessing you're having them predict something like roman-roman-roman-blank-italic-italic-italic-blank etc., character-by-character. Is that right?
EDIT: found these slides which help clarify a lot. Clever approach, I like it!
For the second use case there is no paper available but it's quite similar and the slides you found (I didn't even know those were available online :-)) are probably the best starting point anyways. In general, we used the same approach as we did for the Sanders use case but with two typography levels instead of one. (As far as I remember) Typo1 recognizes normal font, small font and supercript whereas Typo2 focuses on properties like italic, small caps, etc. Another difference compared to the Sanders approach is that the typographical attributes can change within a word which is why we couldn't apply the voting approach on word level and simply assigned the typography outputs to the textual output using a simple heuristic based on positional information.
Just some random thoughts – I like your solution, but I wonder if we couldn't get away with something even simpler. Problems I see with using a second calamari model for typography:
bbbbbbbbbb
, but it's quite easy to accidentally type more/fewer b
s.How about using a pixel classifier on the line images instead?
Just random answers:
bbbbbbbb
instead of one b
has several advantages: a) Possible to capture typographic changes within a word (we had a project where this was the case quite often), b) it is straightforward to use the pretrained OCR weights. I think @chreul I think tested this approachUsing a PC to determine the typography at each "Pixel" seems an interesting idea and I would assume good results, however:
It is also possible to "share" some code. I use the positional prediction of the Calamari types to obtain a pixel-wise labeling (similar to your PC approach!) to solve the alignment.
Congrats on the fantastic tool. Have you given any thought to making the network aware of italic, bold, etc., as well as different types of typefaces? As far as I can tell this should (hopefully) be a relatively small change.
Here's how I imagine it could be implemented:
main
,alternate
,alternate_italic
.