RedFantom / simple-ocr-opencv

A library for simple OCR in Python using OpenCV
GNU Affero General Public License v3.0
4 stars 0 forks source link

Questions regarding training of coloured images? #11

Closed mit456 closed 6 years ago

mit456 commented 6 years ago

Hello @RedFantom ,

Thanks for your reply to fix numpy - deprecation error. I went through the code structure and understood the flow a bit, I tried using example_grounding.py and gave an image (attached here), and was able to build box file corresponding to the image file but I have a couple of questions.

  1. Some of the segment was not detected, how do you add that segment to the box file?
  2. Is BLANK_CLASS indicates the `(space) class, if not how to add (space) in theallowed_chars` list? Basically, I want to detect multiple lines of text from an image.
  3. How to train on the big dataset? I feel I can use ocr.train to train the classifier? I think it needs support for passing multiple files?
  4. Can we use virtualenv for the project to manage pip package versions?

900

RedFantom commented 6 years ago

It's been a while since I worked on this code, but I wll do my best to answer your questions as best I can.

Question 1 While you can edit box files manually, I would not really advise it. The real issue is that you are using an image file which does not have its text separated fully from the rest of the images. Because this library uses feature matching, it is pretty sensitive to disturbances in contrast levels, even though rotation and other manipulations have less effect. The library attempts to detect strong contrast differences and build segments based on that (if I remember correctly).

If you were to add a character to the box file manually (it's in the format character x y w h 0, I can't remember what the ending 0 was for), the classifier would actually be trained with the specific contrast differences of the character you've given it, possibly messing up your model if you do it enough and don't provide enough 'clean' characters. It's better to train with images as clean as possible (black and white, no grayscales, preferably large letters clearly and fully separated along a single line) and then try to perform OCR on the more difficult images than the other way around.

Question 2 Yes, BLANK_CLASS is valid for spaces, but note that it also represents any other whitespace (except newline, those are not classified), so also \t and some others. As long as the image can be segmented, processing multiple lines from a single image should not be an issue. I have never tried with characters with different colors or anything like that, though.

Question 3 If you want to train a bigger dataset, then you should loop over your files and train each and everyone one of them one at a time. Note that each and every file will have to be grounded individually first, which, depending on the method you're using might take a long time. Using UserGrounder will be most accurate (because you can dismiss non-character segments), but if your images are perfectly segmentable you can use TextGrounder as well, if you were to have a training set of which you have some sort of dictionary of the text contents.

Question 4 Sure, I guess. I never use virtualenv. All my dependencies are working perfectly fine from the system-wide Python installation, so I just don't bother using it, to be honest.

General remark While I have moved some work into this library to make it more usable as a library (and for me, it has served its purpose), I must point out that Goncalo first built this as a learning experience and educational tool more than an actual library (he emphasized this in one of my PRs, I think). If you are looking for a high-accuracy, easy to use, flexible OCR system, I can recommend Tesseract. It's complicated, sure, but it's results are amazing, even at default settings. It just requires a lot more computational power.

If you have any more questions, feel free to ask them! It might take a day, or two, but I'll answer them all. If you don't have any more questions, then I'd appreciate it if you closed the issue, so this (basically inactive) repository stays clean.

mit456 commented 6 years ago

Thanks for your valuable comments.