HusseinYoussef / Arabic-OCR

OCR system for Arabic language that converts images of typed text to machine-encoded text.
MIT License
275 stars 73 forks source link
arabic character-segmentation computer-vision dataset image-processing machine-learning neural-network ocr opencv-python scikit-learn segmentation

Arabic OCR

GitHub contributors GitHub forks GitHub stars GitHub license

Important Note

The system currently supports only letters (29 letters) ا-ى , لا (no numbers or special symbols).

Setup

Install python then run this command:

pip install -r requirements.txt

Run

  1. Put the images in src/test directory
  2. Go to src directory and run the following command
    python OCR.py
  3. Output folder will be created with:
    • text folder which has text files corresponding to the images.
    • running_time file which has the time taken to process each image.

Pipeline

Pipeline

Dataset

Examples

Line Segmentation

Line

Word Segmentation

Word

Character Segmentation

Word Word Word Word

Testing

NOTE: Make sure you have a folder with the truth output with same file names to compare it with the predicted text.

From within src folder run:

python edit.py 'output/text' 'truth'

Test

Performance


NOTE

We achieved these results when we used only the flatten image as feature.


References

  1. An Efficient, Font Independent Word and Character Segmentation Algorithm for Printed Arabic Text.

  2. A Robust Line Segmentation Algorithm for Arabic Printed Text with Diacritics.

  3. Arabic Character Segmentation Using Projection Based Approach with Profile's Amplitude Filter .