Integrate Tesseract OCR and Improve Sorting for Comic Processing.

VoxelCubes / PanelCleaner

An AI-powered tool to clean manga panels.

GNU General Public License v3.0

202 stars 16 forks source link

Integrate Tesseract OCR and Improve Sorting for Comic Processing. #85

Closed civvic closed 3 months ago

civvic commented 4 months ago

This PR addresses the enhancements discussed in #59.

This PR introduces several enhancements to the project, focusing on improving OCR capabilities and trying to refine the sorting logic for comics to better accommodate different reading orders. Below are the key changes and their impact:

Key Changes:

Tesseract OCR Integration: Implemented Tesseract OCR as an alternative OCR engine to support more languages, specifically English and Japanese. This integration allows users to choose between the built-in OCR model and Tesseract based on their needs and the languages of their comic strips.
Configuration Updates: Added new configuration options (ocr_use_tesseract, ocr_tesseract_lang, reading_order) to the default.conf file, enabling easy switching between OCR models and setting preferred languages and reading orders.
Sorting Logic Enhancement: Refined the sorting algorithm for text boxes to align with the expected reading orders for both occidental (left-to-right, top-to-bottom) and manga (right-to-left, top-to-bottom) styles. This includes adjustments to handle slight misalignments and ensure a more natural reading flow.

Motivation:

The motivation behind these changes is to enhance the flexibility of the PanelCleaner project in processing comics. By integrating Tesseract OCR, the project can now support a wider range of languages, making it more versatile and useful for a global audience. The improvements to the sorting logic ensure that the processed comic strips reflect the intended reading order, enhancing the user experience.

Testing:

These changes have been tested with a variety of comic pages in English to ensure that the OCR recognition is accurate, and that the sorting logic correctly aligns with the specified reading orders.

Caveats:

I aimed to integrate Tesseract OCR with minimal disruption to the existing codebase, focusing on enhancing functionality while preserving the current project structure and logic. My familiarity with the codebase is still growing, and I've prioritized making changes that are straightforward and easily reversible, to ensure that the core functionality remains stable and reliable. I recognize that there may be more efficient or elegant ways to achieve this integration, and I'm very much open to feedback and suggestions. My goal was to lay a foundation for Tesseract OCR (and any other OCR engine) support that could be built upon and refined. I believe that with your deeper understanding of the project and its architecture, we can further improve this feature together.

I'm looking forward to your insights and am ready to collaborate on making any necessary adjustments. Thank you for considering my contribution, and I hope it serves as a valuable stepping stone towards more versatile OCR capabilities within the project.

VoxelCubes commented 4 months ago

This is incredible, thanks for putting in the effort! My focus has shifted to other projects, so I likely wouldn't have gotten around to this in quite a while, so thank you very much!

It looks pretty good, apart from needing to place what you added in the default.conf into the actual string templates of the config.py file. The default.conf is just part of the documentation, it alone can't influence anything.

Understandably, you didn't want to touch the gui side just yet, especially if I ended up rejecting the pr outright, which you don't need to worry about.

Fortunately, the entire profile UI is loaded dynamically from the string templates in config.py, so it's just a matter of integrating the stuff you did in main.py into the gui/processing.py file. Probably best if I look into that on the weekend, but feel free to get a head start. An additional nicety I'll probably add for the gui is a new type for the reading order, something like str | RedingOrder. This would let the gui detect that and insert a combobox with the options pre-filled, rather than letting the user enter whatever string. More of a QoL feature though, that can wait.

There is one complication, however, and that has to do with tesseract model data. How does that work, are they all bundled into the python package? If not, it would be best to pre-load them in the setup greeter with the other models.

Also, the mangaocr model takes forever to initialize, so it gets set up as a sort of singleton when the program launches. It would be possible to have a second variable holding the tesseract model(s), then additionally loading whichever on demand when switching the profile. That shouldn't be too bad, likely something I'll have to look into though.

For now, I'll test this a bit, but so far so good!

PS. thanks for that little change for MacOS. I don't have a good way to test it there myself.

VoxelCubes commented 4 months ago

See the description of 489064e5426a777c61319f3d84a856c68a2c293b for some details. Hang on, I just realized that you hadn't let the OCR subcommand use the CLI args for --profile because that isn't parsed for the OCR subcommand. Well, in light of the expanded settings for OCR, it's about time that's added to the CLI options.

VoxelCubes commented 4 months ago

All right, that does work pretty well! I ran into a little "tesseract couldn't find the requested language pack" error, which was easy to fix through my package manager, thankfully. That could be a little trickier for other people though. At the very least, the error should be caught more elegantly and explained to the user. To prevent CLI vs GUI conflicts, I'd recommend catching the exception in the main.py and directing the user to the appropriate tesseract documentation, or maybe find a way to load it automatically.

Anyway, I'm out of time now, I'll get back to you at a later date.

civvic commented 4 months ago

Thanks for all the comments! Late night here, I’ll check all tomorrow. This was a quick proof of concept. If you consider it has merits, I’ll try to dive deeply into the code base, including the GUI.

VoxelCubes commented 4 months ago

Yup, having this would definitely be helpful. But be aware that this strategy is fundamentally limited by how well the text detector works, otherwise tesseract is helpless. It can only reliably detect JP and EN. If it helps, here is an outline of what I would do next:

When catching these tesseract missing the language data exceptions from within main.py, you can direct users to this page here https://ocrmypdf.readthedocs.io/en/latest/languages.html it isn't an official page, but it's good documentation.
Add a similar explanation to the config.py string templates, using the [CLI:...] and [GUI:...] syntax as seen for model_path to make the link clickable in the gui.
Repurpose shared_ocr_model: gst.Shared[gst.OCRModel] in main.py:57 to hold one of manga_ocr or the tesserract model. To ensure they don't need to be recreated every time a user switches though, these should still be saved in the mainwindow driver as member variables. For the tesseract, it would probably require saving what language it was loaded with too, because it would need to be reloaded when that changes as well.
Then the maindwindow_driver's load_ocr_model method would need too receive information on what ocr model to produce, depending on profile settings.
Hook up the load_ocr_model function to the signal for profile changes, so that it can check if it needs to change what the shared ocr model points to, loading a new tesseract instance if needed. The required signal here is profile_values_changed, hooking it up should go right under the line with self.profile_values_changed.connect(self.file_table.update_all_rows)
Think about how that could cause race conditions with the current cleaning process, if one is running. So instead of hooking it up directly to the signal, wrap it inside a worker thread. Best idea would probably be to take the existing start_initialization_worker method and split it up into separate ones for clearing the cache and loading the OCR model. This way the loading process ends up in the single-threaded work queue and cannot happen during a cleaning operation.

That should be it for the functional aspects. Next is some quality of life stuff:

back in config.py, create a StrEnum called ReadingOrder which holds the two valid directions and change the type of preprocessor.reading_order from str to str | ReadingOrder, adding a special handler in try_to_load that will check if the text matches one of the enum's values, otherwise printing an error and returning, which will result in the config setting the default value.
in profile_parser this new type would need to be added to the EntryTypes enum, as well as having new cases for this in parse_profile_structure and create_data_widget. That can be based on how I did it for MimeSuffixIMG.
Then the same process, but for the ocr_tesseract_lang setting. But no enum here, as we can't be too sure of the supported languages.
In create_data_widget for this one, use a utility function that calls pytesseract.get_languages(config='') to figure out what languages are supported, and have a hard coded lookup dict for the language code to language name, as defined here https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages But watch out for the language code "osd" as that produces metadata about what alphabet was used. not too helpful, probably, likely best to exclude it.
Using that data, creating a CComboBox (CustomComboBox) using the language code as the linked data, so you can display the friendly name in the foreground, same as with the MimeSuffixIMG. But maybe doing something like "Engling (eng)" could be clearer? Up to you.
Finally, hide the helper text about language codes in the config.py string template from the GUI, that means wrapping those parts in [CLI:...] tags, since in the GUI you'd now have implemented a friendly combobox with actual language names.

For these kinds of changes, you won't need the makefile to regenerate assets. And since I include the compiled assets in the repo (which isn't a good practice, mind you, I just do that so I don't need to figure out a way to generate them when booting up a windows VM for compiling the exe, as Windows can't use a makefile) that should make it easy for you to run the gui from source.

Hope this helps! I've contributed a to a few open source projects myself and always found it hard to figure out where to make the changes, not so much what to do once I figured that part out. So I hope this guiding line will help a lot to actually have a smooth experience, rather than trying to figure out what corner of the code needs changes for some specific thing to work. Good luck!

civvic commented 4 months ago

Thank you for the detailed feedback and guidance. I realize now that I overlooked documenting the Tesseract installation process, as I've been using conda (mamba, specifically) for managing my environments. Additionally, it's important to improve error handling for Tesseract, especially regarding missing language packs (but see below, English is installed by default, and maybe the japanese pack would not be needed). I'll address these issues by updating the requirements, config, and README to include clear instructions for Tesseract installation and enhancing error handling to guide users more effectively. I'll get started on this right away.

It's been quite some time since I last worked with Python executables and engaged with cross-platform GUI frameworks (Wx being the last one I used). Nowadays, my coding is primarily for personal projects, utilizing Jupyter notebooks and the nbdev library to integrate exploration, coding, documentation, and testing seamlessly. However, with your detailed outline, I don't foresee any major issues.

I plan to start with the Mac version, as that's more within my comfort zone, and then move on to Linux. My Linux setup is a barebones Ubuntu, primarily used for model training and remote development, without even a window manager installed. It can dual-boot into Windows, but it's been months since I last used Windows, and with Windows, one can never be too sure what to expect. Once I've made progress on the Mac version, I'll reach out for further advice.

In choosing Tesseract for OCR integration, I was guided by its superior handling of handwritten text and the unique quirks found in comics, compared to alternatives like EasyOCR. I fully recognize that the project's effectiveness is significantly influenced by the Comic Text Detector, which, by the way, performs amazingly well.

Maybe the optimal strategy might be to recommend Tesseract for English texts and manga-ocr for manga, given their respective strengths. Despite my very limited knowledge of manga or Japanese, it's evident from visual comparisons that manga-ocr outperforms Tesseract in handling Japanese texts. I did experiment with installing the Tesseract Japanese language pack as a learning exercise, but I wouldn't recommend Tesseract for processing Japanese. For English, however, Tesseract proves to be quite effective.

As it stands, PanelCleaner is performing exceptionally well. Until the text detection capabilities are expanded to encompass additional languages, I believe that utilizing manga-ocr for Japanese and Tesseract for English—and not extending beyond these without careful consideration—is the best approach. Naturally, there's room for further integration of other OCR solutions for English, Japanese, or both, as well as potential enhancements to text detection. However, it's important to acknowledge that improving text detection is no small feat.

VoxelCubes commented 4 months ago

Indeed, well said. It may be worth considering to automatically switch the OCR engine depending on the detected language, though it isn't entirely perfect in guessing what language something is, so this manual setting could prevent some weirdness, but perhaps offering an auto-mode could work out, since even Japanese manga pages often have Latin words for copyright and stuff printed in some far corner. Although, those are often identified as "unknown" language. It's a bit messy, as machine learning always is.

Oh, before I forget, pytesseract needs to be added to the requirements.txt (for development) and setup.cfg (for the whl package) file, but that's easy.

Something a bit trickier to do, but ultimately no big deal, would be to replace the current way csv files are created (which is raw string manipulation, oof). That code is duplicated across main.py and gui/processing.py. It was contributed by the previous pull request some time ago now, and I didn't bother replacing it with a proper csv library because manga_ocr only outputs full-width unicode characters. It's unable to output any ascii characters that would potentially mess up the csv, but that's no longer going to be the case with English and tesseract, so using the builtin csv module should take care of that instead. It was a nice shortcut while it lasted though, haha.

Fortunately, you won't need to mess much with the Qt side of things; there are plenty of pitfalls there but you should be spared of them for this stuff concerning ocr. An interesting thing I really like about making a python application with Qt is that the Qt runtime is separate, so after launching the app, the control flow is entirely handled by the Qt runtime. This means that python code only starts running again when I click something in the gui. And the neat thing is now, that if that code runs into an exception, it will crash the python thread, but won't bring down the app. So I can just keep using the gui and click more things, starting another python thread and it won't be affected by the previous one having an exception. That makes it incredibly hard to actually crash the app, so the exception handler that shows the error log to the user will typically always appear, letting you know what went wrong.

All that said, it's nicer to actually catch the tesseract exceptions so you can inform the user more directly about the issue and how to fix it. In the CLI with print statements, but for the GUI with the show_warning from gui/gui_utils.py. Formatting the text as html will then make any links clickable.

It can dual-boot into Windows, but it's been months since I last used Windows, and with Windows, one can never be too sure what to expect.

Haha, same for me. It's gotten to the point where I prefer just having a virtual machine, so I'm spared from the inconvenience of rebooting whenever I just need to build the windows exe. Speaking of the exe, I wonder how tesseract will work with that. Sounds like the user will need to install that himself then, in which case your documentation will be very handy. So thank you for that.

And lastly, testing... Yeah, I haven't done much of that in this project, partially because writing tests for GUIs is significantly more painful. At least with DeepQt that I'm working on again now, I'm putting in some effort to test more things, and it's proven quite useful for catching bugs early.

civvic commented 4 months ago

Hi, I've made quite a few changes. Ultimately, I decided to streamline the configuration options to just two: ocr_engine and reading_order. Both can be set to auto, allowing PanelCleaner to automatically detect the page language and decide which OCR engine to use (manga-ocr for Japanese and unknown languages, Tesseract for English).

I've created a new module, ocr, to centralize all OCR-related functionalities. Now, each engine has its own adapter. This should somewhat ease future integrations. Don't know if this affects bundling.

A significant change is that we're no longer passing around the OCR model itself but rather a mapping of all the configured OCR engines (see ocr.py).

Regarding performance and caching, pytesseract is a simple wrapper around the Tesseract command line, so there's nothing to cache. I've experimented a bit with tesserocr, a direct CPython integration with the Tesseract library, which is much faster and multithreaded and would be easier to bundle than pytesseract, I think. It works quite well, but I haven't had time to fully explore the implications of switching from pytesseract to tesserocr.

The GUI and everything else works on Mac, but I haven't tested it on other platforms yet. Given the changes, I don't anticipate any issues on Linux/Windows. The only potential problem could arise from the Tesseract installation.

For future improvements, my suggestion would be to perform OCR not on the original image but on the extracted text and perhaps relax the mask generation criteria a bit when OCRing. Tesseract, in particular, struggles with balloon frames and other image artifacts. As I know next to nothing of Japanese or manga or have readily testing images, I can't judge the quality with manga pages.

Looks like I've given you quite a bit to review – sorry about the homework! 😅 Really appreciate your time and can't wait to hear what you think. Thanks a bunch!

VoxelCubes commented 4 months ago

Hey, this looks pretty solid, I like the approach! I used the black formatter (via the make action) to tidy things up and fixed a few tiny things with the config strings. By the way, the way I create the default.conf is by just saving the default profile as something and then copying the file's contents over. The only manual step there is to preserve the description at the very top. The actual default profile has none.

In the last commit I also added 2 TODOs, regarding informing the user about tesseract missing in the gui somehow, and the other about the problematic double loading of the MangaOCR model, which takes several seconds. Normally, this is dealt with by loading it in a worker process at startup, which puts it on a secondary thread so it doesn't freeze the GUI. Now however you also load it as part of initialize_profiles, see this stack trace:

init, ocr_mangaocr.py:11 get_ocr_processor, ocr.py:39 load_current_profile, mainwindow_driver.py:1012 initialize_profiles, mainwindow_driver.py:889 initialize_ui, mainwindow_driver.py:206 init, mainwindow_driver.py:120 launch, launcher.py:141 main, main.py:277

, main.py:722 Which does delay startup by several seconds. Another issue is the way multiple MangaOCR instances are created in get_ocr_processor. This wastes a lot of memory and time again due to MangaOCR being so fat. I think there is an easy solution here though, one that doesn't involve worker threads even, unless you'd prefer to stick with them. If you make the MangaOCR a singleton that is not initialized until it is called, which only happens when processing something, which only happens from a worker thread anyway, it should solve the issue. This will of course increase the time the first image needs to process, but as a singleton it should only affect the first image, so not a terrible compromise. In that case the OCR initialization worker load_ocr_model could be used to call it once to remediate the issue again. I'll send you some sample pages to test with. From my short testing I noticed that, rather disappointingly, comic text detector failed to assign the correct language half the time, resulting in the entire page being miscategorized. Not sure you can do much to fix that, other than re-introducing the override. Since I don't have Japanese tesseract lang packs installed, it silently reverted to MangaOCR. Not the best UX, but it actually ties into the following too: Another thing that surprised me is that you put in this effort to allow auto language detection, but then just do it on a page-by-page basis. Japanese Manga sometimes have little texts (typically logos or disclaimers) in English, so having it switch on a box-by-box basis would be useful here. Speaking of boxes, I see what you mean with tesseract not liking the bubble frames in its input. MangaOCR naturally never did mind that much, but tesseract is rather picky. It would be possible to place all the masking machinery into the OCR process, then use the isolated text output for OCR. As long as the cleaning threshold is overridden to be infinite (so no bubbles are ignored, though if it couldn't be cleaned well, odds are it won't be OCRed well) it should catch everything and give tesseract a cleaner base to pull from. Lots to consider, but I do like where this is going. Thank you so much!

civvic commented 4 months ago

Another thing that surprised me is that you put in this effort to allow auto language detection, but then just do it on a page-by-page basis. Japanese Manga sometimes have little texts (typically logos or disclaimers) in English, so having it switch on a box-by-box basis would be useful here.

Regarding the auto language detection being applied on a page-by-page basis rather than box-by-box, I appreciate your insight. Current approach was actually my third iteration. Initially, I simply aimed to complete the first PR without altering the configuration options significantly., but then though that was unnecessarily complicated for the user. However, as I delved deeper, considering that the text detector classifies each text blob individually, I explored the possibility of OCR'ing each balloon/box separately.

This led to the consideration of extending the Box class specifically for the OCR step, without entangling it with the rest of PanelCleaner's functionalities. Box is a cornerstone structure within PanelCleaner, and modifying it required altering several function signatures (and I had already changed several function sigantures):

class TaggedBox(Box):
    lang: DetectedLang = DetectedLang.UNKNOWN

However, I soon realized that Box is a frozen dataclass (attr), which underscored its significance and the intention to keep it immutable. This realization, along with the complexity of a Box's lifecycle, prompted me to reconsider. Although using a dictionary to tag boxes with languages seemed like a viable alternative, the frequent cloning, copying, and extending of boxes throughout the codebase posed a challenge in maintaining a synchronized mapping.

Consequently, I opted for a compromise: settling on a single language detection for the entire page. This decision was made with the understanding that it might not be the ideal solution, especially for content with multilingual text elements. The straightforward fix would involve modifying the Box class to include language information directly, but such a change would have broader implications that I believe are best evaluated by you, given your comprehensive understanding of the project's architecture. Alternatively, implementing a Box protocol for function and variable signatures, with two distinct frozen dataclasses (Box and TaggedBox), could offer a solution with minimal impact on the OCR stuff. This decision, particularly in the context of comics and manga where multilingual text may vary in prevalence, is something I'd like to defer to your expertise.

civvic commented 4 months ago

Apologies for the formatting issues. Lately, I've been coding primarily for personal projects and haven't been using code formatters. I'll make sure to configure Black in VSCode for any future contributions. If there's a specific configuration you prefer for the project, please do share it with me.

Regarding manga-ocr initialization, I'll investigate the matter further, but it seems that implementing MangaOcr as a singleton (or sharing one manga-ocr instance) with deferred initialization might be the most straightforward solution. I appreciate the suggestion.

Additionally, I wanted to share a bit about what brought me to this project. My interest lies in panel segmentation solutions, where balloon detection/cleaning, as facilitated by PanelCleaner, could be considered an initial step. My ultimate goal is to construct datasets for training or fine-tune models aimed at comics restoration. This encompasses tasks such as inverse halftoning, decolorization, line extraction, alignment, restoring damaged images from poor printings or conservation efforts, etc . Finding datasets with good annotations for these tasks has been challenging. While there are some small ones like eBDtheque or DCM, and Manga109—which, despite several attempts to contact the authors, seems to be an abandoned project or my credential aren't kosher—I haven't found comprehensive resources. My work often involves freelance digital restoration for publishers of classic comics in Spain, but this venture is a personal passion project, born from my love for the medium.

I've delved into numerous papers and repositories, and as Jeremy Howard often emphasizes, I find that Python or coding in general offers a more intuitive understanding of deep learning concepts than mathematical formulas alone. The literature and (working) repositories on this specific problem is scarce, especially outside of manga, making every bit of learning valuable. Working with PanelCleaner and navigating the challenges of pretrained models has been enlightening. The time spent here has been fruitful, and I'm grateful for the opportunity to contribute and learn in this space.

Lastly, considering the substantial nature of the ongoing changes and discussions, I suggest moving this conversation to a discussion thread to keep the PR commit history streamlined and focused. The Tesseract adapter PR seems like a significant shift that might benefit from a more dedicated space for dialogue. What are your thoughts?

civvic commented 4 months ago

Apologies for the formatting issues. Lately, I've been coding primarily for personal projects and haven't been using code formatters. I'll make sure to configure Black in VSCode for any future contributions. If there's a specific configuration you prefer for the project, please do share it with me.

Never mind, I see pyproject.toml

VoxelCubes commented 4 months ago

Never mind, I see pyproject.toml

Either by configuring black manually, or by using the makefile action make black-format, whichever you prefer. Fortunately I didn't mess up by having different line length values in the makefile vs pyproject.toml. I think the default line length of 88 is too short, hopefully 100 is comfortable, though also not very wide.

VoxelCubes commented 4 months ago

Lastly, considering the substantial nature of the ongoing changes and discussions, I suggest moving this conversation to a discussion thread to keep the PR commit history streamlined and focused. The Tesseract adapter PR seems like a significant shift that might benefit from a more dedicated space for dialogue. What are your thoughts?

If I understood that right, we can certainly move the less-on-topic discussion to emails or the like, whichever you prefer. Hearing your motivations is certainly quite interesting, not to mention that you're the second person related to the Spanish comic/manga industry to shape this project, something I find quite intriguing.

VoxelCubes commented 4 months ago

About the boxes having additional metadata, I can now see where the problem was, if you were intent on avoiding the implications of adding another attribute to them. I don't think that's such a bad thing though, and shouldn't pose too large of an issue, due to box merging and growing being handled by box methods. You'd need to make a decision on which language to prefer when merging two boxes of differing languages, but I expect a simple heuristic that chooses the language for the merged box based on which language the larger box has to be sufficient.

It should suffice to include the language in the #clean.json, but can be dropped in #mask_data.json in lieu of a default value of None, as I doubt it will be useful after OCR has concluded. But it's likely easier to just keep it, not sure. Let me know if the json serialization causes problems, but I doubt it would, considering the StrEnum behaves like a str subclass, even equating to true when compared to a string of the same value.

That should make it simple to include the language in the PageData.visualize() method, which would result in it being displayed in the preprocessor's details view outputs. Since visualize draws multiple boxes, only drawing the language for the merged extended boxes should result in the least clutter.

civvic commented 4 months ago

About the boxes having additional metadata, I can now see where the problem was, if you were intent on avoiding the implications of adding another attribute to them. I don't think that's such a bad thing though, and shouldn't pose too large of an issue, due to box merging and growing being handled by box methods.

Awesome, adding the language attribute straight into the boxes sounds like the way to go. I reckon the merge process might get a bit tricky, but it seems like something we can handle. I'll start tackling that soon (very carefully).

civvic commented 4 months ago

I've made adjustments to manage Tesseract OCR integration more gracefully, reintroducing ocr_use_tesseract as an opt-in feature. This change ensures users are only prompted about Tesseract installation if they explicitly choose to use it, streamlining the user experience and maintaining current OCR behavior by default.

Updated Logic for OCR Engine Selection:

When ocr_use_tesseract is False:
- Manga-ocr is used for all OCR tasks.
- The OCR engine selector is limited to 'auto' or 'manga-ocr', both effectively using manga-ocr.
When ocr_use_tesseract is True:
- ocr_engine set to 'auto':
- Manga-ocr for Japanese.
- Tesseract for English and Unknown languages.
- ocr_engine set to 'manga-ocr':
- Manga-ocr is used for all languages.
- ocr_engine set to 'tesseract':
- Tesseract for English and Unknown.
- Tesseract for Japanese if the language pack is installed; otherwise, falls back to manga-ocr.

This approach allows users to opt into Tesseract usage and select the OCR engine that best fits their needs without unnecessary warnings about Tesseract installation.

VoxelCubes commented 4 months ago

This is looking pretty good now. The uncomfortably hacky interference with the user's profile settings on a widget level was not an approach I'd take. You have already ensured that enabling tesseract gets ignored if it isn't available, so merely warning the user about the issue should suffice. I placed the hook at startup and when applying a new glossary instead, I hope you understand.

civvic commented 4 months ago

Absolutely, no worries at all. GUI programming indeed has its complexities and nuances. Unfortunately, I haven't had the chance to learn Qt or explore the GUI components of PanelCleaner.

As for the next steps, I've managed to test on Mac but haven't yet had the opportunity to test on Linux or Windows, nor have I tackled bundling and related tasks. I'm also unsure how or if this PR impacts those aspects.

VoxelCubes commented 4 months ago

The hard part of bundling will be with getting tesseract into a Linux Flatpak. All of that is handled over here, if you are curious https://github.com/flathub/io.github.voxelcubes.panelcleaner but dealing with that is not something I'd wish upon anyone, I'll suffer through it for you.

civvic commented 4 months ago

Got it, and thanks for the heads-up on the Linux Flatpak bundling fun. Given the potential difficulties and time investment required, I'm inclined to spend a bit more effort on enhancing the value of the PR before tackling the bundling issue. I've got a couple of ideas that shouldn't drag us too deep into GUI territory:

Playing around with Tesseract's accuracy, maybe tweaking the masks or skipping them altogether.
Testing out a newer OCR engine to see how the current setup handles it.

Also, if we're going down the road of bundling the whole Tesseract shebang, not just the PyTesseract wrapper, we might want to peek at tesseocr. It's snappier than PyTesseract, and since we're bundling Tesseract's libs anyway, it could be a neat fit.

How does that sound? If you're okay with this approach, I suggest we postpone the bundling concerns for a few days while I work on these ideas.

VoxelCubes commented 4 months ago

Bundling can wait, of course. For flatpak I'll have to bundle actual tesseract (that means installing it inside the sandbox environment, because a system install isn't accessible from inside the sandbox), but for the other 2 executable formats (windows exe, Linux elf) bundling would be a bad thing. It would prevent users from installing more languages, should that be an option sometime, at least without some very ugly hacks. Being executables means they can still interact with the rest of the system just fine, so no need to bloat our own bundle size (which is limited to 2GB by github).

Right, take your time tweaking things, no need to rush this out. There was also that idea with using the extracted text output for OCR, at least for tesseract since it wasn't trained for that, unlike mangaocr, though maybe it too benefits? You could test that by selecting the extract text output when running the cleaner, then using that instead for OCR. If it then fails to locate the text, you could trick it by first generating the text detection output for an image, then opening the cache folder and replacing the input image in there with the extracted text version (might need a white background added, tesseract might not like transparency). The critical thing here is that the #raw.json still retains the old box information, which won't be regenerated unless explicitly requested.

Oh, one other thing. Tesseract's output will no longer be safe for CSV output files. So instead of the manual string conversion done in 2 places, that would need to use the built-in csv module to properly escape any special characters (most importantly commas and line breaks).

Thanks for everything you're doing, really appreciate it and hope it's enjoyable.

civvic commented 4 months ago

I've made a significant update to the comics_text_detector behavior in commit https://github.com/VoxelCubes/PanelCleaner/pull/85/commits/7aa0305758a7c47ab52c15bdcf802ec8f3470b57, which has shown improved performance on my Mac.
This change is crucial as it directly impacts PanelCleaner's core functionality. Although I've reviewed all related files and found no issues, a thorough check is advisable to ensure no unintended effects arise.

Additionally, this PR introduces a _testbed folder for exploratory programming, primarily containing Jupyter notebooks. To engage with this, you'll need to install Jupyter notebooks and optionally rich. However, due to PanelCleaner's current dependency on Python 11 (because of StrEnum usage), this setup is not compatible with Google Colab as of yet. I plan to explore alternatives when possible.

The first notebook explores Tesseract behavior with respect to text masks and padding. It also introduces a metric to compare OCR performance more objectively (at least for English; the effectiveness for Japanese, considering full-width characters and other specifics, remains uncertain). To experiment with different scenarios, add images to the media/ directory and adjust the base_image and box_idx variables accordingly.

If you prefer not to bother with the notebook setup, you can view them directly on GitHub. If GitHub initially refuses to render the notebook, especially since it contains a lot of images, please persist by refreshing or attempting to open it again.

civvic commented 4 months ago

Now I'll explore the capabilities of Idefics2, as detailed on Hugging Face under the Apache 2.0 license, making it a viable option for integration with PanelCleaner. My initial tests, focusing on cropped boxes in the Idefics2 Playground, have shown very promising performance. Impressively, Idefics2 can be quantized, suggesting it could potentially operate efficiently in various settings, with or without GPU support.

VoxelCubes commented 4 months ago

Just a brief update from me, the github desktop app not authenticating has been a royal pain in the ass. The notebook looks neat, took a while to get it to run in the virtual environment properly. Any reason it has the leading underscore in the directory name though? It's a bit obnoxious. Oh, and as a side note, I'd recommend clearing all cell outputs for proper opsec, otherwise whatever output was generated is also put into the repository, like your file paths.

The idefics sounds pretty interesting, certainly ambitious. I wonder what its limits are.

civvic commented 4 months ago

With my current workflow, nbdev (also in VSCode) takes care of all that; notebooks are just a bunch of JSONs with some metadata, which don't play well with git. In this case, I wasn't sure if you liked or wanted to bother installing Jupyter, so I generated all the outputs in HTML without cleaning them up, so you or anyone else could view the file on GitHub. I'll make it lighter for the next one.

The _testbed folder isn't intended to be included in the PR. It's a private space for exploring ideas more comfortably. That's why the underscore, similar to a Python private variable. Out of curiosity, why does it bother you, something from Linux?

Idefics is wonderful. I tested it with my Linux rig, a 3090 TI 24 Gb. It also worked on my Mac, an old M1. The OCR is spotless, but my tests focus more on fine-tuning and quantization to 4 bits, to determine the minimum GPU requirements. I believe an 8 Gb card would suffice. This week, I'm traveling, so I'm not sure how much progress I'll make. Hopefully, I'll be able to finish the PR next week.

VoxelCubes commented 4 months ago

Ah, you left them in for me to look without needing to run it, very thoughtful, thanks! Clearing outputs is just a good practice to avoid accidentally leaking sensitive info, so it's all right as long as you look out for that.

Oh, if that's what you meant by that directory name, that's all right then.

Quantizing to 4bits, like 4bits per pixel on the input? Efficiency optimization is always interesting. Have safe travels!

civvic commented 4 months ago

No, I meant the model not the images.

Quantizing the model: Idefics-8b has 8 billion parameters, which is substantial compared to OCR models like manga-ocr or Tesseract, but relatively small next to large language models like GPT-4 or Gemini 1.5. Models with 8 billion parameters are designed to be operable on consumer-grade GPUs using float16 or other quantized data types. The memory consumption on the GPU depends on the data type of the parameters:

float32 (4 bytes per parameter): (8,000,000,000 x 4 = 32,000,000,000) bytes or 32 GB. This size necessitates a professional-grade GPU such as the A100 or H100.
float16 (2 bytes per parameter): (8,000,000,000 x 2 = 16,000,000,000) bytes or 16 GB.

These figures represent the raw size of the parameters, do not include additional overhead such as model architecture information, optimizer states, memory alignment, or any applied compression.

If the model is quantized to 4 bits per parameter:

4 bits (0.5 bytes per parameter): (8,000,000,000 x 0.5 = 4,000,000,000) bytes or 4 GB. Supposedly, the model's performance in downstream tasks is not significantly affected, which is what I aim to test for OCR applications.

In addition to the model itself, storage for inputs is required, and the underlying architecture of transformers, like the one used in Idefics, are memory hungry. A relatively large image, say 1200 x 1600 pixels, barely runs on my RTX 3090 using float16. Fortunately, since we only process OCR on small image segments, I haven't encountered significant issues.

Quantizing the Idefics model to 4 bits, if it does not detrimentally affect performance, means it could potentially run on the edge and practically any consumer-grade GPU. The recently released Llama-3-8b is already running on iPhones.

Timings: Loading the Idefics model with float16 on my Nvidia RTX 3090 Ti takes about 8 seconds. Running inference on the boxes takes approximately 1 - 1.1 seconds per box. With quantization to 4 bits, these timings could potentially decrease, offering faster inference times. So, comparable with manga-ocr or Tesseract.

So, well researched technologies like LoRA and QLoRA are promising for fine-tuning and compressing deep learning models efficiently. I'm optimistic about their capabilities, but the real test will be their impact on model efficiency. Theoretically, it should be minimal, but I'm looking forward to seeing how they perform in practice with our OCR tasks.

civvic commented 4 months ago

Moreso, I believe integrating a vision model like Idefics into PanelCleaner could significantly enhance its capabilities. Consider this example from the notebook:

Tesseract's output (best method, dilation 0.2, 95% accuracy):

"EMBOWERED BY GREAT ONARLED\nCYPRESS TREES, THE ANCIENT\nMANOR STANCES ALONE ON THE\nOLTSKIRTS OF NEW ORLEANS,\nKEPT TIDY BY A WHITE-HAIRED\nOLO MAN KNOWN ONLY AS\n\nBAMBL.\n"

After post-processing, we get:

"Embowered by great onarled cypress trees, the ancient manor stances alone on the oltskirts of new orleans, kept tidy by a white-haired olo man known only as bambl."

A human reader instantly knows "NEW ORLEANS" should be capitalized, but traditional OCR or post-processing methods, which operate at the character level, can't discern this. Ideally, an old-school Named Entity Recognition (NER) model could identify such nuances, but that introduces the complexity of managing another model.

In contrast, Idefics (or any similar Large Vision Model, LVM) produces:

"Embowered by great gnarled cypress trees, the ancient manor stands alone on the outskirts of New Orleans, kept tidy by a white-haired old man known only as Bambu."

Spot on! This accuracy is because Idefics is more than just an OCR tool; it's a multi-modal LLM that understands context, including the significance of "New Orleans" as a proper noun that should be capitalized.

Recall your comment about mantra? Mantra's approach is inspired by a well-known 2020 paper by its founders. They do more or less what you do in PanelCleaner, text detection and OCR. The key insight, however, is to leverage the content within a panel to enhance OCR and translation by considering who's speaking, the depicted action, etc. This insight underscores the value of a vision model. While in 2020, such models weren't readily accessible, today we have several options, including Idefics and Llava, which are efficient enough to run on edge devices.

VoxelCubes commented 4 months ago

That's all tremendously interesting, thanks for sharing that! Getting the model down would be good, and maybe even have this be an optional enhancement, in case someone has slow internet, or thinks a few GB are excessive.

I think that kind of model might actually let us properly order bubbles the way a reader would interpret them, no need for imperfect heuristics...though that does only work, if the author didn't make it too confusing for even a human. Though I don't suppose that's quite a possibility yet, due to performance reasons, as you mentioned. But there is now light at the end of the tunnel for this otherwise cursed problem.

VoxelCubes commented 3 months ago

I added a few unrelated hotfixes to master, hence the little merge here to smooth over the few conflicts.

imKota commented 3 months ago

@civvic Will it work on macOS with rx590 8gb video card ?

VoxelCubes commented 3 months ago

I added a new output showing the actual raw, unfiltered box data, including what it thinks the language of each bubble is. Hope it helps in your endeavors, and that I didn't just cause merge conflicts. It also serves as a baseline to see just what gets filtered by the preprocessor.

civvic commented 3 months ago

@civvic Will it work on macOS with rx590 8gb video card?

Not out-of-the-box, I'm afraid. Quantized and fine-tuned? Probably. Check out the Idefics 2 Model card. I’m currently using idefics2-8b with Flash Attention 2 and fp16. It runs without issues on my Ubuntu 22.04.4 with an RTX 3090 TI 24GB, peaking at about 17.5GB of memory usage, as the page indicates. So, when quantized with AWQ or bitsandbytes, it should probably be okay.

Idefics 2 can process one or more high-resolution images directly or by splitting them, but be aware, the vision component of the model is memory-intensive due to its transformer-based architecture. Fortunately, PanelCleaner processes each text box individually, and they're relatively small.

@VoxelCubes Last week I was mostly off the grid and couldn't make much progress. I plan (but you know, programmer timing 🙂) to upload today the Tesseract and Idefics 2 experiments. These are quite interesting, in my opinion, including pages in several languages: English, French, Spanish, and Japanese, along with a framework to handle these and future experiments.

VoxelCubes commented 3 months ago

Fujicrow let me know that claude opus has spectacular OCR capabilities, being able to decipher this:

Could be another alternative to explore, though it isn't possible to run it locally, which isn't optimal.

VoxelCubes commented 3 months ago

Oh, I also heard that pyside6.7.0 doesn't like the compiled resources anymore on macos either, so since I updated them in 0fda6b9 you will probably need to upgrade the PySide6 version installed in your environment.

I also heard that it then works, but several icons don't load, which is really puzzling. Does re-running make compile-qrc from the project root (in the PanelCleaner directory) fix it? Or does that not even work? May need to tweak the RCC_COMPILER path in the makefile. I was really hoping these qt resource bundles wouldn't cause issues and be nice and cross-platform, but for some reason qt6.7.0 is determined to prove me otherwise.

I can implement a workaround that just includes all the files manually, if it's totally broken.

civvic commented 3 months ago

Fujicrow let me know that claude opus has spectacular OCR capabilities, being able to decipher this

Could be another alternative to explore, though it isn't possible to run it locally, which isn't optimal.

I haven't had a chance to explore Claude 3 OCR yet—dealing with a VPN is quite a hassle here in Europe, although I'm a fan of having a long context, which is fantastic. However, in all the experiments, GPT-4V has been almost 100% accurate, significantly outperforming both Tesseract and Idefics. Interestingly, unless explicitly directed otherwise, GPT-4 employs Tesseract for its underlying OCR if possible before its own vision capabilities kick in. The real magic happens during the post-processing phase.

I haven't written the adapter for GPT-4V yet; it’s actually simpler to implement than both Tesseract and Idefics. I want to first complete our evaluation framework before moving on to the next adapter. There are numerous LLMs/VLMs to experiment with, which is why I'm keen on getting our evaluation framework up and running. We could use this framework not only for OCR but also for other new processes if necessary.

Fingers crossed, I should be able to commit the framework by tomorrow evening or Wednesday at the latest. I'm considering splitting this PR into two parts: one for the current updates without the evaluation stuff, and another for what I've been working on these past few days. This approach will allow us to proceed with the Tesseract integration as it stands now and address the Idefics integration and, most importantly, the evaluation framework (think a much improved what's now in the _testbed folder) in a separate PR. Some new goodies: running the notebooks with PanelCleaner on Colab and Idefics already having a brand new Mac silicon version.

Oh, I also heard that pyside6.7.0 doesn't like the compiled resources anymore on macos either, so since I updated them in 0fda6b9 you will probably need to upgrade the PySide6 version installed in your environment.

That's unfortunate! I'll double-check this after updating my OS. I tested the GUI just last week and everything was running smoothly. It might be an issue specific to Sonoma; I'm on Ventura yet. I generally prefer to wait a while before updating.

VoxelCubes commented 3 months ago

Oh, yes, a proper environment to get empirical or at least repeatable test data is invaluable to determining what direction to go with these vision models. Very exciting! Splitting up the pr could he a good idea, to adapt to the changing focus.

The pyside update isn't realted to your OS, just whatever version of the python package is installed. It will only update by doing pip install PySide6 --upgrade.

civvic commented 3 months ago

Removed the _testbed folder. This PR has been reverted to its state from weeks ago, and we are now ready to proceed with the merge if agreed. Idefics exploration and the evaluation framework will be addressed in a separate PR later today.

VoxelCubes commented 3 months ago

Cool and good, I'll give this a final review then.

VoxelCubes commented 3 months ago

Looks good to me! I'll merge this evening if nothing else comes up.