unnecessary spaces injected in Japanese text (OCR-related)

coltonoscopy commented 3 years ago

Hey, this is an awesome project! :D I've been using it a little bit after a few other random OCR solutions for my own Japanese studying, but all solutions seem to be plagued by OCR injecting random spaces into the text, including Google's own Vision API, unfortunately. For example:

謂れも知らず忌み続けてきた

might end up getting parsed by the Vision API as:

謂れ も知らず 忌み 続けて きた

And unfortunately, this results almost 100% of the time in a different translation (sometimes slightly, sometimes severely), so the effects ripple into the experience and into the HTML view as well when trying to parse compounds with rikaikun, altogether slowing down and muddying the experience significantly. However, this seems like something that could be easily fixed prior to the translation phase a la simply (JS as an example):

visionText.replace(/ /g, '')

Since, as you know, Japanese never really has spaces in it anyway, so it's a relatively safe text replacement :) Something I could prob tackle myself, but if it's easy enough for you to throw in in an update or something, this would be super neat! Thanks so much for your hard work!

SethRobinson commented 3 years ago

Hey thanks for the comments.

Honestly I haven't noticed this problem in the testing I've done (primarily games) - can you give me a sample screenshot to use as a test case? I could see exactly where the issue is.

As you mentioned, it would be easy to add an option to filter out spaces, but this would only work if Google still sees them as being in the same sentence, otherwise I may have to tweak my own code that decides if two words are different things (for example, two buttons) by looking at the relative character spacing between them.

Seth A. Robinson www.rtsoft.com

On Fri, Apr 2, 2021 at 7:23 AM Colton Ogden @.***> wrote:

Hey, this is an awesome project! :D I've been using it a little bit after a few other random OCR solutions for my own Japanese studying, but all solutions seem to be plagued by OCR injecting random spaces into the text, including Google's own Vision API, unfortunately. For example:

謂れも知らず忌み続けてきた

might end up getting parsed by the Vision API as:

謂れも知らず忌み続けてきた

And unfortunately, this results almost 100% of the time in a different translation (sometimes slightly, sometimes severely), so the effects ripple into the experience and into the HTML view as well when trying to parse compounds with rikaikun, altogether slowing down and muddying the experience significantly. However, this seems like something that could be easily fixed prior to the translation phase a la simply (JS as an example):

visionText.replace(/ /g, '')

Since, as you know, Japanese never really has spaces in it anyway, so it's a relatively safe text replacement :) Something I could prob tackle myself, but if it's easy enough for you to throw in in an update or something, this would be super neat! Thanks so much for your hard work!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SethRobinson/UGT/issues/17, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADAKMZ3N2N7ZLHLD2SYJE3TGTW7DANCNFSM42H4C2HA .

coltonoscopy commented 3 years ago

Hey, sure thing (and thanks for the response)! Here's just the very first sample I fetched from the screen I was on in the game I was playing (Final Fantasy XIV Online) when I read the GitHub message:

background

The text should be:

舞台上で最も哀れな役者

but is instead

舞台 上 で 最も 哀れ な 役者

which, in a small example, doesn't subtract too much meaning (none at all in this case), but in a larger textbox with spacing between the wrong characters in a compound especially, this can confuse rikaikun and also present different translations through Google Translate :) Even if the meaning is still largely intact, I've noticed sometimes it's not quite on the mark with spaces. If you're doing actual work with the vision algorithm beyond feeding it into Google Vision and fetching the result though, I may have proposed an over-simplified solution, so apologies if so :)

Hope this is helpful!

SethRobinson commented 3 years ago

Thanks very much - was surprised to find spaces being added everywhere to asian languages but I just didn't notice due to the normal UGT display font having very thin spaces! I've released 0.68 with the fix.

The problem was I was manually adding spaces between whatever Google reported was a "word", obviously a bad idea with Asian languages. There may be a tiny use case with hiragana/katakana only games (say, on the famicom) that do use spacing for legibility but the translation is so awful in cases like that I'm not worried about it for now anyway.

(closing, but can re-open if the fix didn't work right or something)

coltonoscopy commented 3 years ago

Awesome, thanks so much Seth! Can't wait to bust it out; going to be a great help with rikaikun especially! :D

SethRobinson / UGT

unnecessary spaces injected in Japanese text (OCR-related) #17