language / script detection

OpenPecha / Toolkit

🛠 Tools to create, edit and export texts and annotations

https://toolkit.openpecha.org

Apache License 2.0

7 stars 4 forks source link

language / script detection #191

Closed eroux closed 1 year ago

eroux commented 2 years ago

In order to fill the language layer in the opfs we import from OCR, we are currently relying on the language tag returned by the ocr service. While this works reasonably well for Google Vision, the hocr files of Google Books only have that information at a page level, which is an issue when we have multiple language on a page (a title page of a modern print, with a Tibetan and a Chinese title for instance).

What we should do instead is have a script detection system. In a first iteration, we could only detect:

Chinese script, that we would tag with the language "zh"
Tibetan script, tagged "bo"
Devanagari script, tagged "sa-Deva"
letters (not numbers or symbols) in Latin script, tagged as "en"
the rest would be the default language

ta4tsering commented 2 years ago

so that means I have write a regex to detect what language is the text belong to and put that as the Language in bbox, rather than parsing the page language info from the hocr or parising the paragraph language in IA format hocr right ?

eroux commented 2 years ago

correct yes, although you just need to detect the script, not the language

eroux commented 2 years ago

the latest Unicode data is on http://www.unicode.org/Public/UNIDATA/Scripts.txt . Oddly enough, I can't find a straightforward library in Python that would give the result directly... Note that the "common" script should be mapped to the default language

eroux commented 2 years ago

something like that should work to parse it: https://github.com/googlefonts/nototools/blob/main/nototools/unicode_data.py#L470

you can do this in the main OCR class since it's probably also something we want to do for Google Vision import

ta4tsering commented 2 years ago

okay got it, thanks

eroux commented 2 years ago

A function that does exactly this in a very efficient way is fontTools.unicodedata.script(), let's use it

see https://github.com/fonttools/unicodedata2/issues/57