OpenPecha / Toolkit

🛠 Tools to create, edit and export texts and annotations
https://toolkit.openpecha.org
Apache License 2.0
7 stars 4 forks source link

language / script detection #191

Closed eroux closed 1 year ago

eroux commented 1 year ago

In order to fill the language layer in the opfs we import from OCR, we are currently relying on the language tag returned by the ocr service. While this works reasonably well for Google Vision, the hocr files of Google Books only have that information at a page level, which is an issue when we have multiple language on a page (a title page of a modern print, with a Tibetan and a Chinese title for instance).

What we should do instead is have a script detection system. In a first iteration, we could only detect:

ta4tsering commented 1 year ago

so that means I have write a regex to detect what language is the text belong to and put that as the Language in bbox, rather than parsing the page language info from the hocr or parising the paragraph language in IA format hocr right ?

eroux commented 1 year ago

correct yes, although you just need to detect the script, not the language

eroux commented 1 year ago

the latest Unicode data is on http://www.unicode.org/Public/UNIDATA/Scripts.txt . Oddly enough, I can't find a straightforward library in Python that would give the result directly... Note that the "common" script should be mapped to the default language

eroux commented 1 year ago

something like that should work to parse it: https://github.com/googlefonts/nototools/blob/main/nototools/unicode_data.py#L470

you can do this in the main OCR class since it's probably also something we want to do for Google Vision import

ta4tsering commented 1 year ago

okay got it, thanks

eroux commented 1 year ago

A function that does exactly this in a very efficient way is fontTools.unicodedata.script(), let's use it

see https://github.com/fonttools/unicodedata2/issues/57