JaidedAI / EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
https://www.jaided.ai
Apache License 2.0
24.42k stars 3.16k forks source link

EasyOCR VS Tesseract #285

Closed hahmad2008 closed 3 years ago

hahmad2008 commented 4 years ago

EasyOCR is not only for scanned images, isn't it? because I know Tesseract needs pre-processing for images that are not scanned to make them look like scanned images to have a good performance.

GokulNC commented 3 years ago

EasyOCR uses this code to generate dataset and trains on it: https://github.com/Belval/TextRecognitionDataGenerator

From what I can guess, EasyOCR is more better towards scanned images because of the above. This is also similar to how Tesseract generates synthetic data.

Basically, we see text recognition under 2 classes:

  1. Optical Character Recognition: Which is basically optimized better for documents, etc.
    • TRDG was an example of OCR dataset generator.
  2. Scene Text Recognition: Which is optimized better for free-type images.
    • SynthText is an example of STR dataset generator.

So both EasyOCR & Tesseract fall under OCR I believe. To decide which one is better is upto your experiment. From what I've experimented, I can qualitatively say that EasyOCR's recognition models is somewhat better than Tesseract's recognition models (but not drastically).

Note that I am not taking about the detection part. EasyOCR library uses CRAFT model for detection which is DL-based, hence obviously better than current Tesseract's classical page segmentation-based text detection.

ghandic commented 3 years ago

Tesseract fails on scenes due to it not know how to binarize the image, using DB/CRAFT + Tesseract works pretty well and is optimal for CPU when not understanding your incoming images

ColonelThirtyTwo commented 3 years ago

More generally, as a developer who just wants to OCR stuff, what makes this library different from Tesseract or other OCR solutions? Why should I use this library? Where does this excel at?

ghandic commented 3 years ago

@ColonelThirtyTwo this repo is more for scene text or general text extraction - Tesseract on its own only really works well with well formatted and aligned documents OR small regions extracted from scene text that is not cursive/odd fonts

ColonelThirtyTwo commented 3 years ago

So basically, based on @ghandic and @GokulNC 's comments, Tesserract works well for scanned print documents, whereas EasyOCR works well for extracting texts in general scenes / random pictures. Is that right?

That's the kind of info that I would like to know when learning about projects, so that I can see if it is appropriate for me to use it or not. I recommend putting that on your website.