VoxelCubes / PanelCleaner

An AI-powered tool to clean manga panels.
GNU General Public License v3.0
208 stars 16 forks source link

Spanish OCR support #30

Closed Keiser04 closed 1 week ago

Keiser04 commented 8 months ago

so I can help

VoxelCubes commented 8 months ago

Hey, the model used here for OCR is https://huggingface.co/kha-white/manga-ocr-base but it's easily swappable. Thing is, for spanish text, which is probably not gonna be vertical, I hope, Tesseract should work fine, the free and open OCR tool that Google uses. I'd just need to integrate it.

Alternatively, you can train your own model and make a python package that works just like MangaOCR, should tesseract have trouble with manga fonts for some reason. I have no experience training or writing models though.

The bigger problem is comic text detector, which is used to find where text is and what language it's supposed to be. I tested it briefly and unfortunately it's incapable of detecting spanish text reliably, and more importantly, classifying it as spanish (or even non-japanese, at least). It thinks it's either english or japanese, which won't work here, because for japanese text we'd need to fall back to the current model that handles Japanese text much better than tesseract. Now we could just manually override that and force the use of tesseract, but the other problem is that the comic text detector didn't pick up on all bubbles. It ignored some of them, which is a bigger problem, because there is no manual override that could fix a missing bubble.

So what you could do is help out the guys at comic text detector to get support for spanish working reliably, then we could have a shot at this here.

VoxelCubes commented 1 week ago

In the upcoming version 2.9.0 you'll be able to enable tesseract and force it to tag all bubbles with spanish, so that it will use tesseract's spanish module (as long as the datapack for that is installed, details in the README). That will let you do okayish spanish OCR. Tesseract just isn't good with comic fonts.