OCR for Japanese text (especially vertical text) degraded since 1.9.5

yaywalter commented 9 months ago

OCR for Japanese text, including vertical text, initially worked when it was first introduced, but is degraded in more recent versions. Testing by opening the same selection of manga volumes in various versions one at a time and going through the pages spamming CTRL+A to see what text gets selected, it seems the issue was introduced in version 1.9.5 (spotty detection of horizontal and vertical Japanese text) and got worse in 1.9.6 (better about detecting horizontal Japanese text than 1.9.5 but basically no longer recognizes vertical Japanese text at all). Versions 1.9.7 and 1.9.8 seem to perform identically to 1.9.6.

EDIT: Nevermind, the "detected" text in 1.9.4 and earlier is wildly incorrect when you actually copy/paste it, so technically 1.9.5 and newer represent an improvement since the copied text is actually correct. So I suppose this is just a general request to look into improving the OCR implementation for Japanese text, which seems like it should be possible since my understanding is that this app is piggybacking off of Apple's OCR tech yet extracting the pages and opening them individually in Preview (or even just previewing the CBZ in Finder with SimpleComic's Quick Look extension) yields much better OCR results for Japanese text including vertical text.

DavidPhillipOster commented 9 months ago

Simple Comic’s OCR support is implemented in OCRVision.m and is mostly a simple call to Apple’s OCR, as implemented by Apple’s OCR Vision.framework but there is one tunable parameter that is not brought out to the user interface. As Apple revises macOS, Simple Comic will see whatever the current version of Vision.framework.

Currently, Simple Comic’s Settings dialog box has a single item for OCR, a checkbox "[] Recognize Text". This works by writing to the preference file com.ToWatchList.SimpleComic.plist in ~/Library/Containers/com.ToWatchList.SimpleComic/Data/Library/Preferences/ with the boolean valued key OCRDisableKey - the absence of the key in that file means that OCRing is enabled.

There is one more: the string valued key OCRLanguageKey - the absence of the key means en-US. If you set it to ja-JP then the next time you run Simple Comic, you'll get improved response for Japanese text, at the cost of worse performance in other languages. The source code that takes the language previously read from the plist, and passes it to the OCR engine.

 defaults write ~/Library/Containers/com.ToWatchList.SimpleComic/Data/Library/Preferences/com.ToWatchList.SimpleComic.plist OCRLanguageKey ja-JP

should set the key, and

defaults delete ~/Library/Containers/com.ToWatchList.SimpleComic/Data/Library/Preferences/com.ToWatchList.SimpleComic.plist OCRLanguageKey

should revert it back to its default state. As you might well imagine, since this was never brought out to a user interface in Simple Comic’s Settings dialog box, this mechanism is not well tested.

DavidPhillipOster commented 9 months ago

On my mac, running Sonoma (macOS 14.2.1)

VNRecognizeTextRequest *textRequest = [[VNRecognizeTextRequest alloc] initWithCompletionHandler:^(VNRequest *request, NSError *error){}];
sOCRLanguages = [textRequest supportedRecognitionLanguagesAndReturnError:nil];

sets sOCRLanguages to:

[
en-US,
fr-FR,
it-IT,
de-DE,
es-ES,
pt-BR,
zh-Hans,
zh-Hant,
yue-Hans,
yue-Hant,
ko-KR,
ja-JP,
ru-RU,
uk-UA,
th-TH,
vi-VT
]

I'd intended to translate those codes to localized names, and put them in a popup menu in Simple Comic's Settings dialog box, along with an additional "Detect Language" choice, with the currently selected item saved in preferences the OCRLanguageKey. That's why the code to initialize the data structures is in + (void)initialize of OCRVision: so the data would be available early in case the user wanted to change it before opening any documents.

This would be a pretty easy to complete feature enhancement, but I don't want to make the Settings dialog too geeky.

nickv2002 commented 9 months ago

I'm curious what you think @yaywalter after trying the defaults write change suggested by @DavidPhillipOster - does it improve OCR to match what you see in Preview.app?

If it does indeed make a difference, how is Preview.app detecting the language it should use? Or is your system running in Japanese?

yaywalter commented 9 months ago

Unfortunately setting OCRLanguageKey to ja-JP doesn't seem to improve the OCR performance of Japanese text for me: https://www.youtube.com/watch?v=MLTL1DhPfMY

My system language and region are English/United States, but I do have my system configured to allow for Japanese input.

DavidPhillipOster commented 9 months ago

Try this: Using Simple Comic, navigate to a page of a book with vertical Japanese text. Use File > Capture Page to select the page as graphics, then click on the page to get a Save As dialog. That will save that one page as a simple image file. (I saved mine on my desktop.) Now, open that image file in Preview. Try to select the OCRed text there.

Preview’s rules for tracking the mouse during text selection are a bit different from Simple Comic's, since Simple Comic is tuned more to speech bubbles and Preview more to a single column of text filling the whole page. Copy-Paste the selected text into a text document.

Does Preview work better than Simple Comic?

Does adding Japanese to System Settings > General > Language & Region > Preferred Languages > + > 日本語一 Japanese (and re-starting Preview, or Simple Comic) help at all?

yaywalter commented 9 months ago

Testing that, yes capturing the page and opening the exported image in Preview yields better OCR results.

I already have Japanese in my preferred languages, but I tested making it my system language temporarily to see if it would have an effect.. it did not.

DavidPhillipOster commented 9 months ago

Thank you. After your reply I reviewed the documentation and header files for Vision.framework to see if there is anything added recently that I previously missed that would improve Japanese recognition. I didn't find any missed opportunities.

MaddTheSane / Simple-Comic

OCR for Japanese text (especially vertical text) degraded since 1.9.5 #111