Closed eroux closed 1 year ago
so that means I have write a regex to detect what language is the text belong to and put that as the Language in bbox, rather than parsing the page language info from the hocr or parising the paragraph language in IA format hocr right ?
correct yes, although you just need to detect the script, not the language
the latest Unicode data is on http://www.unicode.org/Public/UNIDATA/Scripts.txt . Oddly enough, I can't find a straightforward library in Python that would give the result directly... Note that the "common" script should be mapped to the default language
something like that should work to parse it: https://github.com/googlefonts/nototools/blob/main/nototools/unicode_data.py#L470
you can do this in the main OCR class since it's probably also something we want to do for Google Vision import
okay got it, thanks
A function that does exactly this in a very efficient way is fontTools.unicodedata.script(), let's use it
In order to fill the language layer in the opfs we import from OCR, we are currently relying on the language tag returned by the ocr service. While this works reasonably well for Google Vision, the hocr files of Google Books only have that information at a page level, which is an issue when we have multiple language on a page (a title page of a modern print, with a Tibetan and a Chinese title for instance).
What we should do instead is have a script detection system. In a first iteration, we could only detect: