OpenPecha / Toolkit

🛠 Tools to create, edit and export texts and annotations
https://toolkit.openpecha.org
Apache License 2.0
7 stars 4 forks source link

forcing the -bo option in Google OCR? #111

Open eroux opened 3 years ago

eroux commented 3 years ago

Here are some remarks from someone who uses OpenPecha on BUDA and is also uses Google OCR directly:

Oh, as I start to use the e-text facility of https://library.bdrc.io/ I notice that occasionally google ocr has OCR'd Tibetan as Devanāgari! I wonder if -bo option was used during OCR and google overroad it, or it was left in multilingual mode. I have not seen this behavior in my own use of the -bo option.

Do we force the language to Tibetan when we run the OCR?

eroux commented 3 years ago

An example of failure: in that case the OCR added some Punjabi in the middle of the Tibetan:

W1PD95844-I1PD95910-394-394-any.docx W1PD95844-I1PD95910-394-394-any.pdf

eroux commented 3 years ago

the languageHints field of imageContext seems to be what we want to set to "bo"

10zinten commented 3 years ago

sure, for this matter can we assume that all bdrc works to be in tibetan, or is there any bdrc API endpoint that we can query to get the language of a particular work?

eroux commented 3 years ago

the OP_Info query should have it (if not you can default to Tibetan), see for instance http://purl.bdrc.io/query/graph/OP_info?R_RES=bdr:W1PD95844 :

Capture d’écran de 2021-09-16 08-35-21

in a first iteration, you can handle just bdr:LangBo, bdr:LangEn and bdr:LangZh (and not set the value if there are other languages). If there are multiple values for language (which happens from time to time), you can just not set the value

10zinten commented 3 years ago

great!, I will push the fix as soon as possible.

eroux commented 3 years ago

for reference (and tests), here's a work with multiple languages: http://purl.bdrc.io/query/graph/OP_info?R_RES=bdr:W1AC406

eroux commented 3 years ago

Just out of curiosity, can you run the OCR on https://iiif.bdrc.io/bdr:I1PD95910::I1PD959100394.jpg/full/max/0/default.jpg with the bo option?

eroux commented 3 years ago

Just to fully document the issue: this seems to be a case where Google Vision is trying to read the other side of the paper:

Capture d’écran de 2021-09-16 10-39-35

Capture d’écran de 2021-09-16 10-39-05

so in that specific case, setting the language to Tibetan will not help significantly... @ngawangtrinley perhaps we can do a bit of preprocessing of the contrast on the image?

Also @10zinten we should save the settings used for Google Vision (the language setting but perhaps others too?) in the meta.xml of the pecha

eroux commented 3 years ago

I think this should perhaps be an option that can be set manually when doing the OCR (or through a command line option or something). In some cases (like blockprints) we really want to use it, but in some modern prints we may want to specify "bo", "zh" when there's a Chinese intruduction... but the default could be bo

ngawangtrinley commented 3 years ago

ok, quick feedback on both topics. 

  1. binarizing vs not binarizing
    1. I remember asking about binarizing images to Zach and/or someone at google and they said that google has it's own pipeline to do these things and we have a better chance to get good quality
    2. they also said that the rule of thumb for pre-processing was to make images look as good as possible to the human eye since that's what the models are calibrated on
    3. we did a lot of testing a couple of years ago and one of the things that stood out is that low quality images loose a lot with binarizing, especially those that are a bit blurry and/or with a low resolution
  2. forcing "bo"
    1. @10zinten made the relevant point that rubbish Tibetan is much more difficult to detect and cleanup than rubbish strings in random scripts. That's especially true with google OCR which detects on syllable/chunk at a time and seem to use text generation too. i.e. ཀ་།བད་ཁ་ལ་ཁ་བ་འབབ་ཁྲིད་ཀ་ཁ 7 ན། are all legal syllables, the only thing we can inconclusively rely on is the fact it's a cluster of LFMs (low frequency monosyllabic words), but even that is not going to work for shorter rubbish strings. 
    2. for beginning and ending pages we could try both Chinese and Tibetan and check the average confidence before deciding what to save. Maybe doing an average confidence test followed by a forced OCR in the other language if under a threshold could be done by default for the first 20 and last 20 pages or each pecha. I'm not sure how scalable something like this would be though.

I guess the ideal solution is to setup a simple interface similar to the one we made for Namsel, and ideally plug in a payment method.

eroux commented 3 years ago

Thanks! Yes, I agree forcing bo all the time is not a good option, but could we have it as an option when we run the ocr script?

10zinten commented 3 years ago

fix https://github.com/OpenPecha-dev/img2opf/commit/355192d64d782f060e2a8ce86655b239e3a64b0c

eroux commented 3 years ago

Thanks! It might be best to have the possibility to specify a list of languages instead of just one, but that can wait. We can use it already, thanks!