Open eroux opened 3 years ago
An example of failure: in that case the OCR added some Punjabi in the middle of the Tibetan:
W1PD95844-I1PD95910-394-394-any.docx W1PD95844-I1PD95910-394-394-any.pdf
the languageHints field of imageContext seems to be what we want to set to "bo"
sure, for this matter can we assume that all bdrc works to be in tibetan, or is there any bdrc API endpoint that we can query to get the language of a particular work?
the OP_Info query should have it (if not you can default to Tibetan), see for instance http://purl.bdrc.io/query/graph/OP_info?R_RES=bdr:W1PD95844 :
in a first iteration, you can handle just bdr:LangBo
, bdr:LangEn
and bdr:LangZh
(and not set the value if there are other languages). If there are multiple values for language (which happens from time to time), you can just not set the value
great!, I will push the fix as soon as possible.
for reference (and tests), here's a work with multiple languages: http://purl.bdrc.io/query/graph/OP_info?R_RES=bdr:W1AC406
Just out of curiosity, can you run the OCR on https://iiif.bdrc.io/bdr:I1PD95910::I1PD959100394.jpg/full/max/0/default.jpg with the bo option?
Just to fully document the issue: this seems to be a case where Google Vision is trying to read the other side of the paper:
so in that specific case, setting the language to Tibetan will not help significantly... @ngawangtrinley perhaps we can do a bit of preprocessing of the contrast on the image?
Also @10zinten we should save the settings used for Google Vision (the language setting but perhaps others too?) in the meta.xml of the pecha
I think this should perhaps be an option that can be set manually when doing the OCR (or through a command line option or something). In some cases (like blockprints) we really want to use it, but in some modern prints we may want to specify "bo", "zh"
when there's a Chinese intruduction... but the default could be bo
ok, quick feedback on both topics.
I guess the ideal solution is to setup a simple interface similar to the one we made for Namsel, and ideally plug in a payment method.
Thanks! Yes, I agree forcing bo all the time is not a good option, but could we have it as an option when we run the ocr script?
Thanks! It might be best to have the possibility to specify a list of languages instead of just one, but that can wait. We can use it already, thanks!
Here are some remarks from someone who uses OpenPecha on BUDA and is also uses Google OCR directly:
Do we force the language to Tibetan when we run the OCR?